{"id":30,"date":"2024-03-10T18:26:47","date_gmt":"2024-03-10T18:26:47","guid":{"rendered":"https:\/\/wordpress.joeltan.me\/?p=30"},"modified":"2026-01-10T00:08:16","modified_gmt":"2026-01-10T00:08:16","slug":"building-your-personal-data-powerhouse-self-hosting-apache-spark-on-kubernetes-part-1","status":"publish","type":"post","link":"https:\/\/joeltan.me\/?p=30","title":{"rendered":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes &#8211; Part 1"},"content":{"rendered":"\n<p>Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping for air like it just ran a marathon \ud83d\ude05. We&#8217;ve all been there &#8211; you download what seems like a reasonable dataset, only to find your machine grinding to a halt when you try to open it in pandas.<\/p>\n\n\n\n<p>The NYC trip data is a perfect example of this challenge. It&#8217;s a fascinating dataset with billions of taxi and ride-sharing trips across New York City, but at several terabytes, it&#8217;s enough to make any personal computer wave the white flag of surrender. The <a href=\"https:\/\/www.nyc.gov\/site\/tlc\/about\/tlc-trip-record-data.page\">TLC website<\/a> doesn&#8217;t pull any punches about its size &#8211; this is serious big data territory.<\/p>\n\n\n\n<p>But here&#8217;s the thing &#8211; I didn&#8217;t want to shell out hundreds of dollars for cloud computing when I already had three perfectly good computers sitting at home, doing nothing most of the time. That&#8217;s when it hit me: what if I could combine their powers, Power Ranger style?<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Assembling Your Data Processing Avengers<\/h2>\n\n\n\n<p>The concept is simple but powerful: pool the computational resources of all your PCs to create your own private data processing cluster. It&#8217;s like when the Power Rangers combine their robots to form the Megazord to tackle bigger enemies!<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"478\" src=\"https:\/\/wordpress.joeltan.me\/wp-content\/uploads\/2026\/01\/power-rangers-combination-swords.gif\" alt=\"power-rangers-combination-swords.gif\" class=\"wp-image-34\"\/><\/figure>\n\n\n\n<p>At home, I&#8217;ve got three computers that individually would struggle with heavy data processing. But together? They form a respectable mini-cluster with 16 CPUs and 56GB RAM &#8211; enough firepower to handle substantial data analytics tasks. Here&#8217;s my humble home fleet:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"431\" height=\"231\" src=\"https:\/\/wordpress.joeltan.me\/wp-content\/uploads\/2026\/01\/cluster-pc.drawio.png\" alt=\"cluster-pc.drawio.png\" class=\"wp-image-33\" srcset=\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/cluster-pc.drawio.png 431w, https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/cluster-pc.drawio-300x161.png 300w\" sizes=\"auto, (max-width: 431px) 100vw, 431px\" \/><\/figure>\n\n\n\n<p>I deliberately included a Windows host in this setup because, let&#8217;s be honest, most of us have at least one Windows PC at home. And surprisingly, I couldn&#8217;t find any comprehensive guide for creating a multi-OS, multi-host Kubernetes setup &#8211; so I decided to build one myself!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What We&#8217;re Building &amp; Why Spark Matters<\/h2>\n\n\n\n<p>This is the first in a series where I&#8217;ll show you how to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a lightweight Kubernetes cluster using K3s (this article)<\/li>\n\n\n\n<li>Deploy Apache Spark to process data at scale<\/li>\n\n\n\n<li>Connect it all to big datasets like the NYC Trip data stored in AWS S3<\/li>\n<\/ol>\n\n\n\n<p>But why Spark? When you&#8217;re dealing with datasets that make your computer cry, Apache Spark is the superhero you need. It&#8217;s designed to distribute processing across multiple machines, turning what would be hours of processing into minutes. And the best part? You can write your analytics in Python using PySpark, so there&#8217;s no need to learn a completely new language.<\/p>\n\n\n\n<p>Self-hosting gives you:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Complete control over your infrastructure<\/li>\n\n\n\n<li>No surprise cloud bills at the end of the month<\/li>\n\n\n\n<li>A playground to learn valuable skills for your career<\/li>\n\n\n\n<li>The satisfaction of squeezing value from hardware you already own<\/li>\n<\/ul>\n\n\n\n<p>Let&#8217;s dive in and build something cool!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">In this article<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"#setup-server-node-and-worker-node-1-linux-pc\">Setup Server Node and Worker Node #1 (Linux PC)<\/a><\/li>\n\n\n\n<li><a href=\"#setup-worker-node-2-windows-laptop\">Setup Worker Node #2 (Windows Laptop)<\/a><\/li>\n\n\n\n<li><a href=\"#create-kubernetes-cluster-multi-nodes\">Create Kubernetes Cluster (Multi-Nodes)<\/a><\/li>\n\n\n\n<li><a href=\"#deploy-kubernetes-dashboard\">Deploy Kubernetes Dashboard<\/a><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Setup Server Node and Worker Node #1 (Linux PC)<\/h2>\n\n\n\n<p>Several approaches exist for installing Linux, each suited to different requirements and preferences.<\/p>\n\n\n\n<p>In my setup with Proxmox VE, I&#8217;ve configured two LXCs running Ubuntu 22.04, though detailing the creation process falls beyond the scope of this guide.<br>It required considerable effort to operationalize, so if you&#8217;re contemplating this route, be prepared for potential complexities involving Kernel modules,<br>OS configurations, and the use of unprivileged containers.<br>On the other hand, installing Ubuntu on a virtual machine or directly onto hardware (BareMetal) presents a more straightforward alternative,<br>with numerous tutorials available online.<\/p>\n\n\n\n<p>Regardless of the method chosen, ensure the establishment of two Ubuntu 22.04 hosts prior to advancing to the subsequent section of this guide.<br>For consistency and to prevent compatibility issues, I&#8217;ve standardized on the same Ubuntu 22.04 version across all hosts.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setup Worker Node #2 (Windows Laptop)<\/h2>\n\n\n\n<p>I&#8217;m utilizing a pre-existing Windows laptop, and the forthcoming steps will guide you through the process of setting up a new virtual machine on a Windows system.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Prerequisites<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure your processor supports virtualization technology.<\/li>\n\n\n\n<li>Enable Hyper-V or install VirtualBox. I was getting slow performance with Virtualbox and would recommend Hyper-V if using Windows 11 Pro edition.<\/li>\n<\/ul>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Spin up an Ubuntu VM using Multipass<\/li>\n<\/ol>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Visit <a href=\"https:\/\/multipass.run\/docs\/installing-on-windows\">Multipass<\/a> install page and follow the steps<\/li>\n\n\n\n<li>Launch a command window and issue the following commands. It is recommended to use at least 8G of memory and specify the network option as the LAN interface.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>  REM Firstly, find out the available network interfaces\n  &gt; multipass networks\n  Name                    Type      Description\n  Default Switch          switch    Virtual Switch with internal networking\n  Ethernet 8              ethernet  Realtek USB 2.5GbE Family Controller\n\n  REM, Set the default bridged network to your LAN adapter (eg Ethernet 8)\n  &gt; multipass set local.bridged-network=\"Ethernet 8\"\n\n  REM Launch an instance with 4 cpus, 8G ram\n  &gt; multipass launch --name vm-k3n1 --cpus 4 --disk 50G --memory 8G --bridged\n  Launched: vm-k3n1\n\n  REM Check the instance's status and note down its IPv4 address\n  &gt; multipass info vm-k3n1\n  Name:           vm-k3n1\n  State:          Running\n  (omitted)\n  IPv4:           172.29.199.195 (eth0)\n                 192.168.1.56 (eth1)\n  Release:        Ubuntu 22.04.4 LTS\n  (omitted)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Create Kubernetes Cluster (Multi-Nodes)<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Install K3s in server node<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   # Install K3s with traefik disabled\n   &gt; curl -sfL https:\/\/get.k3s.io | INSTALL_K3S_EXEC=\"--disable traefik\" sh -s\n   &#91;INFO]  Finding release for channel stable\n   &#91;INFO]  Using v1.28.6+k3s2 as release\n   ...\n   Created symlink \/etc\/systemd\/system\/multi-user.target.wants\/k3s.service \u2192 \/etc\/systemd\/system\/k3s.service.\n   &#91;INFO]  systemd: Starting k3s<\/code><\/pre>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Deploy NGINX ingress controller in server node<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   &gt; kubectl apply -f https:\/\/raw.githubusercontent.com\/kubernetes\/ingress-nginx\/controller-v1.8.2\/deploy\/static\/provider\/cloud\/deploy.yaml\n\n   namespace\/ingress-nginx created\n   serviceaccount\/ingress-nginx created\n   ...\n   validatingwebhookconfiguration.admissionregistration.k8s.io\/ingress-nginx-admission created<\/code><\/pre>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>Get server token<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   &gt; cat \/var\/lib\/rancher\/k3s\/server\/token\n   eg. K10c625fa31bb83efca4409b3657c2bb04a5ccd5bfb8b2487e4baf899f2e0c9bc78::server:5e091abbe36d1cbad49e1de2bb774ee0<\/code><\/pre>\n\n\n\n<ol start=\"4\" class=\"wp-block-list\">\n<li>Install K3s in Node #1 and add node to cluster Install and add using the one liner command below, replacing server url with your server url and mypassword from the token in the earlier step<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   &gt; curl -sfL https:\/\/get.k3s.io | K3S_URL=https:\/\/192.168.1.233:6443 K3S_TOKEN=mypassword sh -s -<\/code><\/pre>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>Install K3s in Node #2 and add node to cluster The installation command mirrors that of Node 1, with the distinction that we are deliberately defining the IP address and network interface. This step is crucial because K3s defaults to selecting the first network interface (eth0), which might not be the desired one for our setup.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   # Do update node-ip, iface, server url and token with your values\n   &gt; curl -sfL https:\/\/get.k3s.io | INSTALL_K3S_EXEC=\"agent --node-ip=192.168.1.56 --flannel-iface=eth1\" K3S_TOKEN=mypassword sh -s - --server https:\/\/192.168.1.233:6443\n   &#91;INFO]  Finding release for channel stable\n   (omitted)\n   &#91;INFO]  systemd: Starting k3s-agent\n\n   # Reload and restart the service if needed, in case the node ip is not updated correctly\n   &gt; sudo systemctl daemon-reload\n   &gt; sudo systemctl restart k3s-agent<\/code><\/pre>\n\n\n\n<ol start=\"6\" class=\"wp-block-list\">\n<li>Check the installation Back in the server, run the commmand below to ensure all the nodes have the status &#8220;Ready&#8221; and their Internal-Ip are all on the same network (in my case, I have them all on the network 192.168.1.0\/24)<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   root@lxc-k3s$ kubectl get nodes -o wide\n   NAME      STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    (omitted)......\n   vm-k3n1   Ready    &lt;none&gt;                 37m   v1.28.6+k3s2   192.168.1.56    &lt;none&gt;        Ubuntu 22.04.4 LTS \n   lxc-k3s   Ready    control-plane,master   11d   v1.28.6+k3s2   192.168.1.233   &lt;none&gt;        Ubuntu 22.04.4 LTS\n   lxc-k3n   Ready    &lt;none&gt;                 11d   v1.28.6+k3s2   192.168.1.234   &lt;none&gt;        Ubuntu 22.04.4 LTS<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Deploy Kubernetes Dashboard<\/h2>\n\n\n\n<p>Kubernetes dashboard is a Web UI that can be installed as an Addon to monitor the various components (pods, services, deployment etc) of your Kubernetes cluster.<br>We&#8217;ll modify the standard installation <a href=\"https:\/\/kubernetes.io\/docs\/tasks\/access-application-cluster\/web-ui-dashboard\/\">guide<\/a> slightly, tailoring it for use within a home laboratory setup<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Download the YAML deploy file<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt; curl -sL https:\/\/raw.githubusercontent.com\/kubernetes\/dashboard\/v2.7.0\/aio\/deploy\/recommended.yaml &gt; dashboard.yaml<\/code><\/pre>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Modify the downloaded file with these alterations. Note: The login will be deactivated. If this configuration does not suit your environment, consider adhering to the original guide and opt for a login token.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code># In the Deployment section, change the container settings to look like this\n....\ncontainers:\n        - name: kubernetes-dashboard\n          image: kubernetesui\/dashboard:v2.7.0\n          imagePullPolicy: Always\n          ports:\n            - containerPort: 9090\n              protocol: TCP\n          args:\n            - '--namespace=kubernetes-dashboard'\n            - '--enable-skip-login'\n            - '--disable-settings-authorizer'\n          volumeMounts:\n            - name: kubernetes-dashboard-certs\n              mountPath: \/certs\n              # Create on-disk volume to store exec logs\n            - mountPath: \/tmp\n              name: tmp-volume\n          livenessProbe:\n            httpGet:\n              scheme: HTTP\n              path: \/\n              port: 9090\n            initialDelaySeconds: 30\n            timeoutSeconds: 30\n          securityContext:\n            allowPrivilegeEscalation: false\n            readOnlyRootFilesystem: true\n            runAsUser: 1001\n            runAsGroup: 2001\n....<\/code><\/pre>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>Deploy the changes<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   root@lxc-k3s$ kubectl apply -f dashboard.yaml\n   namespace\/kubernetes-dashboard created\n   serviceaccount\/kubernetes-dashboard created\n   ...\n   deployment.apps\/dashboard-metrics-scraper created<\/code><\/pre>\n\n\n\n<ol start=\"4\" class=\"wp-block-list\">\n<li>Do a quick test<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   # Get the pod name for dashboard\n   root@lxc-k3s$ kubectl get pods -n kubernetes-dashboard\n   NAME                                         READY   STATUS    RESTARTS       AGE\n   dashboard-metrics-scraper-5657497c4c-np75l   1\/1     Running   3 (50m ago)    3d15h\n   kubernetes-dashboard-657b44677f-zw7zt        1\/1     Running   12 (50m ago)   3d15h\n\n   # Port forward the container port 9090 to the node port 9090\n   root@lxc-k3s$ kubectl port-forward kubernetes-dashboard-657b44677f-zw7zt -n kubernetes-dashboard 9090:9090\n   Forwarding from 127.0.0.1:9090 -&gt; 9090\n   Forwarding from &#91;::1]:9090 -&gt; 9090<\/code><\/pre>\n\n\n\n<p>After executing the commands above, the server node will be listening at port 9090. Visit http:\/\/192.168.1.233:9090 to view the dashboard UI.<\/p>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>Optional &#8211; Setup Ingress for dashboard To access the dashboard, it&#8217;s necessary to confirm that port forwarding is active Or for ease of use, an ingress can be set up with the specified yaml configuration.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   ---\n   apiVersion: v1\n   kind: Service\n   metadata:\n     name: kubernetes-dashboard-http\n     namespace: kubernetes-dashboard\n   spec:\n     selector:\n       k8s-app: kubernetes-dashboard\n     ports:\n       - protocol: TCP\n         port: 80\n         targetPort: 9090\n   ---\n   apiVersion: networking.k8s.io\/v1\n   kind: Ingress\n   metadata:\n     name: dashboard-ingress\n     namespace: kubernetes-dashboard\n   spec:\n     ingressClassName: nginx\n     rules:\n     - host: k3dashboard.com\n       http:\n         paths:\n         - path: \/\n           pathType: Prefix\n           backend:\n             service:\n               name: kubernetes-dashboard-http\n               port: \n                 number: 80<\/code><\/pre>\n\n\n\n<p>Let&#8217;s walkthrough the above code:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We deploy a service that will allow us to find the kubernetes dashboard pod (via the selector) listening at port 9090.<\/li>\n\n\n\n<li>We then deploy an ingress using the NGINX controller we previously installed, which will forward the traffic to the service when the url http:\/\/k3dashboard.com is accessed. For #2 to work, we need to ensure the address will be mapped to the server ip address via DNS. In my case, I updated the Windows host file to include this entry:<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>   # localhost name resolution is handled within DNS itself.\n   #    127.0.0.1       localhost\n   #    ::1             localhost\n   192.168.1.233 k3dashboard.com<\/code><\/pre>\n\n\n\n<p>The dashboard can now be accessed with the url http:\/\/k3dashboard.com, without having to run the port forwarding.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What&#8217;s Next: Unleashing Apache Spark on Your Cluster<\/h2>\n\n\n\n<p>Congratulations! You&#8217;ve just built a mini data center right in your own home. Your Kubernetes cluster is up and running, but this is just the foundation. In the next article of this series, I&#8217;ll show you how to deploy Apache Spark on this cluster and start processing data at a scale that would make your laptop beg for mercy.<\/p>\n\n\n\n<p>Apache Spark will allow you to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Process datasets much larger than your RAM capacity<\/li>\n\n\n\n<li>Run complex data transformations in parallel across all your nodes<\/li>\n\n\n\n<li>Use familiar Python syntax with PySpark<\/li>\n\n\n\n<li>Scale your processing as your computational needs grow<\/li>\n<\/ul>\n\n\n\n<p>The beauty of running Spark on your own hardware is that you&#8217;re not watching a cloud billing meter tick up with every computation. You can experiment, learn, and process large datasets without worrying about costs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why This Matters: Beyond Just a Cool Project<\/h2>\n\n\n\n<p>Building your own data processing cluster isn&#8217;t just a fun weekend project (though it absolutely is that!). It&#8217;s also:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>A learning laboratory<\/strong>: You&#8217;ll gain hands-on experience with technologies that power modern data infrastructure<\/li>\n\n\n\n<li><strong>A stepping stone to bigger things<\/strong>: The skills you learn here translate directly to cloud environments<\/li>\n\n\n\n<li><strong>A practical solution<\/strong>: For those datasets that are too big for your laptop but not quite &#8220;rent a data center&#8221; big<\/li>\n\n\n\n<li><strong>A resource optimizer<\/strong>: Making use of computing power you already own but might be sitting idle<\/li>\n<\/ol>\n\n\n\n<p>Have you built your own data processing setup at home? Are you planning to follow along with this guide? I&#8217;d love to hear about your experiences or answer any questions in the comments below. After all, the best part of self-hosting is the community that comes with it!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":32,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[6],"tags":[],"class_list":["post-30","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-owning-the-stack"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/joeltan.me\/?p=30\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs\" \/>\n<meta property=\"og:description\" content=\"Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping&#046;&#046;&#046;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/joeltan.me\/?p=30\" \/>\n<meta property=\"og:site_name\" content=\"Joel Tan Tech Blogs\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-10T18:26:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-01-10T00:08:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1140\" \/>\n\t<meta property=\"og:image:height\" content=\"760\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Joel Tan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Joel Tan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/joeltan.me\/?p=30#article\",\"isPartOf\":{\"@id\":\"https:\/\/joeltan.me\/?p=30\"},\"author\":{\"name\":\"Joel Tan\",\"@id\":\"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743\"},\"headline\":\"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes &#8211; Part 1\",\"datePublished\":\"2024-03-10T18:26:47+00:00\",\"dateModified\":\"2026-01-10T00:08:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/joeltan.me\/?p=30\"},\"wordCount\":1446,\"commentCount\":0,\"image\":{\"@id\":\"https:\/\/joeltan.me\/?p=30#primaryimage\"},\"thumbnailUrl\":\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg\",\"articleSection\":[\"Owning the Stack\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/joeltan.me\/?p=30#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/joeltan.me\/?p=30\",\"url\":\"https:\/\/joeltan.me\/?p=30\",\"name\":\"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs\",\"isPartOf\":{\"@id\":\"https:\/\/joeltan.me\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/joeltan.me\/?p=30#primaryimage\"},\"image\":{\"@id\":\"https:\/\/joeltan.me\/?p=30#primaryimage\"},\"thumbnailUrl\":\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg\",\"datePublished\":\"2024-03-10T18:26:47+00:00\",\"dateModified\":\"2026-01-10T00:08:16+00:00\",\"author\":{\"@id\":\"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743\"},\"breadcrumb\":{\"@id\":\"https:\/\/joeltan.me\/?p=30#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/joeltan.me\/?p=30\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/joeltan.me\/?p=30#primaryimage\",\"url\":\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg\",\"contentUrl\":\"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg\",\"width\":1140,\"height\":760},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/joeltan.me\/?p=30#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/joeltan.me\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes &#8211; Part 1\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/joeltan.me\/#website\",\"url\":\"https:\/\/joeltan.me\/\",\"name\":\"Joel Tan Tech Blogs\",\"description\":\"Building systems that survive real life\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/joeltan.me\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743\",\"name\":\"Joel Tan\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/joeltan.me\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/d9b5d1ab218cb2478280027d371ea60543f6551132d31a8cbd45a5a5b3fbadc9?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/d9b5d1ab218cb2478280027d371ea60543f6551132d31a8cbd45a5a5b3fbadc9?s=96&d=mm&r=g\",\"caption\":\"Joel Tan\"},\"sameAs\":[\"http:\/\/192.168.1.146\"],\"url\":\"https:\/\/joeltan.me\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/joeltan.me\/?p=30","og_locale":"en_US","og_type":"article","og_title":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs","og_description":"Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping&#46;&#46;&#46;","og_url":"https:\/\/joeltan.me\/?p=30","og_site_name":"Joel Tan Tech Blogs","article_published_time":"2024-03-10T18:26:47+00:00","article_modified_time":"2026-01-10T00:08:16+00:00","og_image":[{"width":1140,"height":760,"url":"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg","type":"image\/jpeg"}],"author":"Joel Tan","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Joel Tan","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/joeltan.me\/?p=30#article","isPartOf":{"@id":"https:\/\/joeltan.me\/?p=30"},"author":{"name":"Joel Tan","@id":"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743"},"headline":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes &#8211; Part 1","datePublished":"2024-03-10T18:26:47+00:00","dateModified":"2026-01-10T00:08:16+00:00","mainEntityOfPage":{"@id":"https:\/\/joeltan.me\/?p=30"},"wordCount":1446,"commentCount":0,"image":{"@id":"https:\/\/joeltan.me\/?p=30#primaryimage"},"thumbnailUrl":"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg","articleSection":["Owning the Stack"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/joeltan.me\/?p=30#respond"]}]},{"@type":"WebPage","@id":"https:\/\/joeltan.me\/?p=30","url":"https:\/\/joeltan.me\/?p=30","name":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1 - Joel Tan Tech Blogs","isPartOf":{"@id":"https:\/\/joeltan.me\/#website"},"primaryImageOfPage":{"@id":"https:\/\/joeltan.me\/?p=30#primaryimage"},"image":{"@id":"https:\/\/joeltan.me\/?p=30#primaryimage"},"thumbnailUrl":"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg","datePublished":"2024-03-10T18:26:47+00:00","dateModified":"2026-01-10T00:08:16+00:00","author":{"@id":"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743"},"breadcrumb":{"@id":"https:\/\/joeltan.me\/?p=30#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/joeltan.me\/?p=30"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/joeltan.me\/?p=30#primaryimage","url":"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg","contentUrl":"https:\/\/joeltan.me\/wp-content\/uploads\/2026\/01\/photo-1517976487492-5750f3195933.jpg","width":1140,"height":760},{"@type":"BreadcrumbList","@id":"https:\/\/joeltan.me\/?p=30#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/joeltan.me\/"},{"@type":"ListItem","position":2,"name":"Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes &#8211; Part 1"}]},{"@type":"WebSite","@id":"https:\/\/joeltan.me\/#website","url":"https:\/\/joeltan.me\/","name":"Joel Tan Tech Blogs","description":"Building systems that survive real life","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/joeltan.me\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/joeltan.me\/#\/schema\/person\/db13342201787db723bfdeadcd792743","name":"Joel Tan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/joeltan.me\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/d9b5d1ab218cb2478280027d371ea60543f6551132d31a8cbd45a5a5b3fbadc9?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d9b5d1ab218cb2478280027d371ea60543f6551132d31a8cbd45a5a5b3fbadc9?s=96&d=mm&r=g","caption":"Joel Tan"},"sameAs":["http:\/\/192.168.1.146"],"url":"https:\/\/joeltan.me\/?author=1"}]}},"_links":{"self":[{"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/posts\/30","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/joeltan.me\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=30"}],"version-history":[{"count":1,"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/posts\/30\/revisions"}],"predecessor-version":[{"id":35,"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/posts\/30\/revisions\/35"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/joeltan.me\/index.php?rest_route=\/wp\/v2\/media\/32"}],"wp:attachment":[{"href":"https:\/\/joeltan.me\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=30"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/joeltan.me\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=30"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/joeltan.me\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=30"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}