8 min read

Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1

Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1
Photo by SpaceX / Unsplash

Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping for air like it just ran a marathon 😅. We've all been there - you download what seems like a reasonable dataset, only to find your machine grinding to a halt when you try to open it in pandas.

The NYC trip data is a perfect example of this challenge. It's a fascinating dataset with billions of taxi and ride-sharing trips across New York City, but at several terabytes, it's enough to make any personal computer wave the white flag of surrender. The TLC website doesn't pull any punches about its size - this is serious big data territory.

But here's the thing - I didn't want to shell out hundreds of dollars for cloud computing when I already had three perfectly good computers sitting at home, doing nothing most of the time. That's when it hit me: what if I could combine their powers, Power Ranger style?

Assembling Your Data Processing Avengers

The concept is simple but powerful: pool the computational resources of all your PCs to create your own private data processing cluster. It's like when the Power Rangers combine their robots to form the Megazord to tackle bigger enemies!

power-rangers-combination-swords.gif

At home, I've got three computers that individually would struggle with heavy data processing. But together? They form a respectable mini-cluster with 16 CPUs and 56GB RAM - enough firepower to handle substantial data analytics tasks. Here's my humble home fleet:

cluster-pc.drawio.png

I deliberately included a Windows host in this setup because, let's be honest, most of us have at least one Windows PC at home. And surprisingly, I couldn't find any comprehensive guide for creating a multi-OS, multi-host Kubernetes setup - so I decided to build one myself!

What We're Building & Why Spark Matters

This is the first in a series where I'll show you how to:

  1. Create a lightweight Kubernetes cluster using K3s (this article)
  2. Deploy Apache Spark to process data at scale
  3. Connect it all to big datasets like the NYC Trip data stored in AWS S3

But why Spark? When you're dealing with datasets that make your computer cry, Apache Spark is the superhero you need. It's designed to distribute processing across multiple machines, turning what would be hours of processing into minutes. And the best part? You can write your analytics in Python using PySpark, so there's no need to learn a completely new language.

Self-hosting gives you:

  • Complete control over your infrastructure
  • No surprise cloud bills at the end of the month
  • A playground to learn valuable skills for your career
  • The satisfaction of squeezing value from hardware you already own

Let's dive in and build something cool!

In this article

Setup Server Node and Worker Node #1 (Linux PC)

Several approaches exist for installing Linux, each suited to different requirements and preferences.

In my setup with Proxmox VE, I've configured two LXCs running Ubuntu 22.04, though detailing the creation process falls beyond the scope of this guide.
It required considerable effort to operationalize, so if you're contemplating this route, be prepared for potential complexities involving Kernel modules,
OS configurations, and the use of unprivileged containers.
On the other hand, installing Ubuntu on a virtual machine or directly onto hardware (BareMetal) presents a more straightforward alternative,
with numerous tutorials available online.

Regardless of the method chosen, ensure the establishment of two Ubuntu 22.04 hosts prior to advancing to the subsequent section of this guide.
For consistency and to prevent compatibility issues, I've standardized on the same Ubuntu 22.04 version across all hosts.

Setup Worker Node #2 (Windows Laptop)

I'm utilizing a pre-existing Windows laptop, and the forthcoming steps will guide you through the process of setting up a new virtual machine on a Windows system.

  1. Prerequisites
  • Ensure your processor supports virtualization technology.
  • Enable Hyper-V or install VirtualBox. I was getting slow performance with Virtualbox and would recommend Hyper-V if using Windows 11 Pro edition.
  1. Spin up an Ubuntu VM using Multipass
  • Visit Multipass install page and follow the steps
  • Launch a command window and issue the following commands. It is recommended to use at least 8G of memory and specify the network option as the LAN interface.
    REM Firstly, find out the available network interfaces
    > multipass networks
    Name                    Type      Description
    Default Switch          switch    Virtual Switch with internal networking
    Ethernet 8              ethernet  Realtek USB 2.5GbE Family Controller
    
    REM, Set the default bridged network to your LAN adapter (eg Ethernet 8)
    > multipass set local.bridged-network="Ethernet 8"
    
    REM Launch an instance with 4 cpus, 8G ram
    > multipass launch --name vm-k3n1 --cpus 4 --disk 50G --memory 8G --bridged
    Launched: vm-k3n1
    
    REM Check the instance's status and note down its IPv4 address
    > multipass info vm-k3n1
    Name:           vm-k3n1
    State:          Running
    (omitted)
    IPv4:           172.29.199.195 (eth0)
                   192.168.1.56 (eth1)
    Release:        Ubuntu 22.04.4 LTS
    (omitted)
    

Create Kubernetes Cluster (Multi-Nodes)

  1. Install K3s in server node

    # Install K3s with traefik disabled
    > curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable traefik" sh -s
    [INFO]  Finding release for channel stable
    [INFO]  Using v1.28.6+k3s2 as release
    ...
    Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
    [INFO]  systemd: Starting k3s
    
  2. Deploy NGINX ingress controller in server node

    > kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.2/deploy/static/provider/cloud/deploy.yaml
    
    namespace/ingress-nginx created
    serviceaccount/ingress-nginx created
    ...
    validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created
    
  3. Get server token

    > cat /var/lib/rancher/k3s/server/token
    eg. K10c625fa31bb83efca4409b3657c2bb04a5ccd5bfb8b2487e4baf899f2e0c9bc78::server:5e091abbe36d1cbad49e1de2bb774ee0
    
  4. Install K3s in Node #1 and add node to cluster

    Install and add using the one liner command below, replacing server url with your server url and mypassword from the token in the earlier step

    > curl -sfL https://get.k3s.io | K3S_URL=https://192.168.1.233:6443 K3S_TOKEN=mypassword sh -s -
    
  5. Install K3s in Node #2 and add node to cluster

    The installation command mirrors that of Node 1, with the distinction that we are deliberately defining the IP address and network interface. This step is crucial because K3s defaults to selecting the first network interface (eth0), which might not be the desired one for our setup.

    # Do update node-ip, iface, server url and token with your values
    > curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent --node-ip=192.168.1.56 --flannel-iface=eth1" K3S_TOKEN=mypassword sh -s - --server https://192.168.1.233:6443
    [INFO]  Finding release for channel stable
    (omitted)
    [INFO]  systemd: Starting k3s-agent
    
    # Reload and restart the service if needed, in case the node ip is not updated correctly
    > sudo systemctl daemon-reload
    > sudo systemctl restart k3s-agent
    
    
  6. Check the installation

    Back in the server, run the commmand below to ensure all the nodes have the status "Ready" and their Internal-Ip are all on the same network (in my case, I have them all on the network 192.168.1.0/24)

    root@lxc-k3s$ kubectl get nodes -o wide
    NAME      STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    (omitted)......
    vm-k3n1   Ready    <none>                 37m   v1.28.6+k3s2   192.168.1.56    <none>        Ubuntu 22.04.4 LTS 
    lxc-k3s   Ready    control-plane,master   11d   v1.28.6+k3s2   192.168.1.233   <none>        Ubuntu 22.04.4 LTS
    lxc-k3n   Ready    <none>                 11d   v1.28.6+k3s2   192.168.1.234   <none>        Ubuntu 22.04.4 LTS
    

Deploy Kubernetes Dashboard

Kubernetes dashboard is a Web UI that can be installed as an Addon to monitor the various components (pods, services, deployment etc) of your Kubernetes cluster.
We'll modify the standard installation guide slightly, tailoring it for use within a home laboratory setup

  1. Download the YAML deploy file
> curl -sL https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml > dashboard.yaml
  1. Modify the downloaded file with these alterations. Note: The login will be deactivated. If this configuration does not suit your environment, consider adhering to the original guide and opt for a login token.
# In the Deployment section, change the container settings to look like this
....
containers:
        - name: kubernetes-dashboard
          image: kubernetesui/dashboard:v2.7.0
          imagePullPolicy: Always
          ports:
            - containerPort: 9090
              protocol: TCP
          args:
            - '--namespace=kubernetes-dashboard'
            - '--enable-skip-login'
            - '--disable-settings-authorizer'
          volumeMounts:
            - name: kubernetes-dashboard-certs
              mountPath: /certs
              # Create on-disk volume to store exec logs
            - mountPath: /tmp
              name: tmp-volume
          livenessProbe:
            httpGet:
              scheme: HTTP
              path: /
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsUser: 1001
            runAsGroup: 2001
....
  1. Deploy the changes

    root@lxc-k3s$ kubectl apply -f dashboard.yaml
    namespace/kubernetes-dashboard created
    serviceaccount/kubernetes-dashboard created
    ...
    deployment.apps/dashboard-metrics-scraper created
    
  2. Do a quick test

    # Get the pod name for dashboard
    root@lxc-k3s$ kubectl get pods -n kubernetes-dashboard
    NAME                                         READY   STATUS    RESTARTS       AGE
    dashboard-metrics-scraper-5657497c4c-np75l   1/1     Running   3 (50m ago)    3d15h
    kubernetes-dashboard-657b44677f-zw7zt        1/1     Running   12 (50m ago)   3d15h
    
    # Port forward the container port 9090 to the node port 9090
    root@lxc-k3s$ kubectl port-forward kubernetes-dashboard-657b44677f-zw7zt -n kubernetes-dashboard 9090:9090
    Forwarding from 127.0.0.1:9090 -> 9090
    Forwarding from [::1]:9090 -> 9090
    

    After executing the commands above, the server node will be listening at port 9090. Visit http://192.168.1.233:9090 to view the dashboard UI.

  3. Optional - Setup Ingress for dashboard

    To access the dashboard, it's necessary to confirm that port forwarding is active Or for ease of use, an ingress can be set up with the specified yaml configuration.

    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: kubernetes-dashboard-http
      namespace: kubernetes-dashboard
    spec:
      selector:
        k8s-app: kubernetes-dashboard
      ports:
        - protocol: TCP
          port: 80
          targetPort: 9090
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: dashboard-ingress
      namespace: kubernetes-dashboard
    spec:
      ingressClassName: nginx
      rules:
      - host: k3dashboard.com
        http:
          paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: kubernetes-dashboard-http
                port: 
                  number: 80
    

    Let's walkthrough the above code:

    1. We deploy a service that will allow us to find the kubernetes dashboard pod (via the selector) listening at port 9090.
    2. We then deploy an ingress using the NGINX controller we previously installed, which will forward the traffic to the service when the url http://k3dashboard.com is accessed.

    For #2 to work, we need to ensure the address will be mapped to the server ip address via DNS. In my case, I updated the Windows host file to include this entry:

    # localhost name resolution is handled within DNS itself.
    #	127.0.0.1       localhost
    #	::1             localhost
    192.168.1.233 k3dashboard.com
    

    The dashboard can now be accessed with the url http://k3dashboard.com, without having to run the port forwarding.

What's Next: Unleashing Apache Spark on Your Cluster

Congratulations! You've just built a mini data center right in your own home. Your Kubernetes cluster is up and running, but this is just the foundation. In the next article of this series, I'll show you how to deploy Apache Spark on this cluster and start processing data at a scale that would make your laptop beg for mercy.

Apache Spark will allow you to:

  • Process datasets much larger than your RAM capacity
  • Run complex data transformations in parallel across all your nodes
  • Use familiar Python syntax with PySpark
  • Scale your processing as your computational needs grow

The beauty of running Spark on your own hardware is that you're not watching a cloud billing meter tick up with every computation. You can experiment, learn, and process large datasets without worrying about costs.

Why This Matters: Beyond Just a Cool Project

Building your own data processing cluster isn't just a fun weekend project (though it absolutely is that!). It's also:

  1. A learning laboratory: You'll gain hands-on experience with technologies that power modern data infrastructure
  2. A stepping stone to bigger things: The skills you learn here translate directly to cloud environments
  3. A practical solution: For those datasets that are too big for your laptop but not quite "rent a data center" big
  4. A resource optimizer: Making use of computing power you already own but might be sitting idle

Have you built your own data processing setup at home? Are you planning to follow along with this guide? I'd love to hear about your experiences or answer any questions in the comments below. After all, the best part of self-hosting is the community that comes with it!