10 Mar 2024 8 min read Python

Building Your Personal Data Powerhouse: Self-Hosting Apache Spark on Kubernetes - Part 1

Let me tell you about the time I wanted to analyze the NYC trip dataset but found my laptop gasping for air like it just ran a marathon 😅. We've all been there - you download what seems like a reasonable dataset, only to find your machine grinding to a halt when you try to open it in pandas.

The NYC trip data is a perfect example of this challenge. It's a fascinating dataset with billions of taxi and ride-sharing trips across New York City, but at several terabytes, it's enough to make any personal computer wave the white flag of surrender. The TLC website doesn't pull any punches about its size - this is serious big data territory.

But here's the thing - I didn't want to shell out hundreds of dollars for cloud computing when I already had three perfectly good computers sitting at home, doing nothing most of the time. That's when it hit me: what if I could combine their powers, Power Ranger style?

Assembling Your Data Processing Avengers

The concept is simple but powerful: pool the computational resources of all your PCs to create your own private data processing cluster. It's like when the Power Rangers combine their robots to form the Megazord to tackle bigger enemies!

At home, I've got three computers that individually would struggle with heavy data processing. But together? They form a respectable mini-cluster with 16 CPUs and 56GB RAM - enough firepower to handle substantial data analytics tasks. Here's my humble home fleet:

cluster-pc.drawio.png

I deliberately included a Windows host in this setup because, let's be honest, most of us have at least one Windows PC at home. And surprisingly, I couldn't find any comprehensive guide for creating a multi-OS, multi-host Kubernetes setup - so I decided to build one myself!

What We're Building & Why Spark Matters

This is the first in a series where I'll show you how to:

Create a lightweight Kubernetes cluster using K3s (this article)
Deploy Apache Spark to process data at scale
Connect it all to big datasets like the NYC Trip data stored in AWS S3

But why Spark? When you're dealing with datasets that make your computer cry, Apache Spark is the superhero you need. It's designed to distribute processing across multiple machines, turning what would be hours of processing into minutes. And the best part? You can write your analytics in Python using PySpark, so there's no need to learn a completely new language.

Self-hosting gives you:

Complete control over your infrastructure
No surprise cloud bills at the end of the month
A playground to learn valuable skills for your career
The satisfaction of squeezing value from hardware you already own

Let's dive in and build something cool!

Setup Server Node and Worker Node #1 (Linux PC)

Several approaches exist for installing Linux, each suited to different requirements and preferences.

In my setup with Proxmox VE, I've configured two LXCs running Ubuntu 22.04, though detailing the creation process falls beyond the scope of this guide.
It required considerable effort to operationalize, so if you're contemplating this route, be prepared for potential complexities involving Kernel modules,
OS configurations, and the use of unprivileged containers.
On the other hand, installing Ubuntu on a virtual machine or directly onto hardware (BareMetal) presents a more straightforward alternative,
with numerous tutorials available online.

Regardless of the method chosen, ensure the establishment of two Ubuntu 22.04 hosts prior to advancing to the subsequent section of this guide.
For consistency and to prevent compatibility issues, I've standardized on the same Ubuntu 22.04 version across all hosts.

Setup Worker Node #2 (Windows Laptop)

I'm utilizing a pre-existing Windows laptop, and the forthcoming steps will guide you through the process of setting up a new virtual machine on a Windows system.

Prerequisites

Ensure your processor supports virtualization technology.
Enable Hyper-V or install VirtualBox. I was getting slow performance with Virtualbox and would recommend Hyper-V if using Windows 11 Pro edition.

Spin up an Ubuntu VM using Multipass

Visit Multipass install page and follow the steps

Launch a command window and issue the following commands. It is recommended to use at least 8G of memory and specify the network option as the LAN interface.

REM Firstly, find out the available network interfaces
> multipass networks
Name                    Type      Description
Default Switch          switch    Virtual Switch with internal networking
Ethernet 8              ethernet  Realtek USB 2.5GbE Family Controller

REM, Set the default bridged network to your LAN adapter (eg Ethernet 8)
> multipass set local.bridged-network="Ethernet 8"

REM Launch an instance with 4 cpus, 8G ram
> multipass launch --name vm-k3n1 --cpus 4 --disk 50G --memory 8G --bridged
Launched: vm-k3n1

REM Check the instance's status and note down its IPv4 address
> multipass info vm-k3n1
Name:           vm-k3n1
State:          Running
(omitted)
IPv4:           172.29.199.195 (eth0)
               192.168.1.56 (eth1)
Release:        Ubuntu 22.04.4 LTS
(omitted)

Create Kubernetes Cluster (Multi-Nodes)

Install K3s in server node

# Install K3s with traefik disabled
> curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable traefik" sh -s
[INFO]  Finding release for channel stable
[INFO]  Using v1.28.6+k3s2 as release
...
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

Deploy NGINX ingress controller in server node

> kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.8.2/deploy/static/provider/cloud/deploy.yaml

namespace/ingress-nginx created
serviceaccount/ingress-nginx created
...
validatingwebhookconfiguration.admissionregistration.k8s.io/ingress-nginx-admission created

Get server token

> cat /var/lib/rancher/k3s/server/token
eg. K10c625fa31bb83efca4409b3657c2bb04a5ccd5bfb8b2487e4baf899f2e0c9bc78::server:5e091abbe36d1cbad49e1de2bb774ee0

Install K3s in Node #1 and add node to cluster

Install and add using the one liner command below, replacing server url with your server url and mypassword from the token in the earlier step
```
> curl -sfL https://get.k3s.io | K3S_URL=https://192.168.1.233:6443 K3S_TOKEN=mypassword sh -s -
```

Install K3s in Node #2 and add node to cluster

The installation command mirrors that of Node 1, with the distinction that we are deliberately defining the IP address and network interface. This step is crucial because K3s defaults to selecting the first network interface (eth0), which might not be the desired one for our setup.

# Do update node-ip, iface, server url and token with your values
> curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="agent --node-ip=192.168.1.56 --flannel-iface=eth1" K3S_TOKEN=mypassword sh -s - --server https://192.168.1.233:6443
[INFO]  Finding release for channel stable
(omitted)
[INFO]  systemd: Starting k3s-agent

# Reload and restart the service if needed, in case the node ip is not updated correctly
> sudo systemctl daemon-reload
> sudo systemctl restart k3s-agent

Check the installation

Back in the server, run the commmand below to ensure all the nodes have the status "Ready" and their Internal-Ip are all on the same network (in my case, I have them all on the network 192.168.1.0/24)

root@lxc-k3s$ kubectl get nodes -o wide
NAME      STATUS   ROLES                  AGE   VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE    (omitted)......
vm-k3n1   Ready    <none>                 37m   v1.28.6+k3s2   192.168.1.56    <none>        Ubuntu 22.04.4 LTS 
lxc-k3s   Ready    control-plane,master   11d   v1.28.6+k3s2   192.168.1.233   <none>        Ubuntu 22.04.4 LTS
lxc-k3n   Ready    <none>                 11d   v1.28.6+k3s2   192.168.1.234   <none>        Ubuntu 22.04.4 LTS

Deploy Kubernetes Dashboard

Kubernetes dashboard is a Web UI that can be installed as an Addon to monitor the various components (pods, services, deployment etc) of your Kubernetes cluster.
We'll modify the standard installation guide slightly, tailoring it for use within a home laboratory setup

Download the YAML deploy file

> curl -sL https://raw.githubusercontent.com/kubernetes/dashboard/v2.7.0/aio/deploy/recommended.yaml > dashboard.yaml

Modify the downloaded file with these alterations. Note: The login will be deactivated. If this configuration does not suit your environment, consider adhering to the original guide and opt for a login token.

# In the Deployment section, change the container settings to look like this
....
containers:
        - name: kubernetes-dashboard
          image: kubernetesui/dashboard:v2.7.0
          imagePullPolicy: Always
          ports:
            - containerPort: 9090
              protocol: TCP
          args:
            - '--namespace=kubernetes-dashboard'
            - '--enable-skip-login'
            - '--disable-settings-authorizer'
          volumeMounts:
            - name: kubernetes-dashboard-certs
              mountPath: /certs
              # Create on-disk volume to store exec logs
            - mountPath: /tmp
              name: tmp-volume
          livenessProbe:
            httpGet:
              scheme: HTTP
              path: /
              port: 9090
            initialDelaySeconds: 30
            timeoutSeconds: 30
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsUser: 1001
            runAsGroup: 2001
....

Deploy the changes

root@lxc-k3s$ kubectl apply -f dashboard.yaml
namespace/kubernetes-dashboard created
serviceaccount/kubernetes-dashboard created
...
deployment.apps/dashboard-metrics-scraper created

Do a quick test

# Get the pod name for dashboard
root@lxc-k3s$ kubectl get pods -n kubernetes-dashboard
NAME                                         READY   STATUS    RESTARTS       AGE
dashboard-metrics-scraper-5657497c4c-np75l   1/1     Running   3 (50m ago)    3d15h
kubernetes-dashboard-657b44677f-zw7zt        1/1     Running   12 (50m ago)   3d15h

# Port forward the container port 9090 to the node port 9090
root@lxc-k3s$ kubectl port-forward kubernetes-dashboard-657b44677f-zw7zt -n kubernetes-dashboard 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

After executing the commands above, the server node will be listening at port 9090. Visit http://192.168.1.233:9090 to view the dashboard UI.

Optional - Setup Ingress for dashboard

To access the dashboard, it's necessary to confirm that port forwarding is active Or for ease of use, an ingress can be set up with the specified yaml configuration.
```
---
apiVersion: v1
kind: Service
metadata:
  name: kubernetes-dashboard-http
  namespace: kubernetes-dashboard
spec:
  selector:
    k8s-app: kubernetes-dashboard
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9090
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dashboard-ingress
  namespace: kubernetes-dashboard
spec:
  ingressClassName: nginx
  rules:
  - host: k3dashboard.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: kubernetes-dashboard-http
            port: 
              number: 80
```
Let's walkthrough the above code:
1. We deploy a service that will allow us to find the kubernetes dashboard pod (via the selector) listening at port 9090.
2. We then deploy an ingress using the NGINX controller we previously installed, which will forward the traffic to the service when the url http://k3dashboard.com is accessed.
For #2 to work, we need to ensure the address will be mapped to the server ip address via DNS. In my case, I updated the Windows host file to include this entry:
```
# localhost name resolution is handled within DNS itself.
#	127.0.0.1       localhost
#	::1             localhost
192.168.1.233 k3dashboard.com
```
The dashboard can now be accessed with the url http://k3dashboard.com, without having to run the port forwarding.

What's Next: Unleashing Apache Spark on Your Cluster

Congratulations! You've just built a mini data center right in your own home. Your Kubernetes cluster is up and running, but this is just the foundation. In the next article of this series, I'll show you how to deploy Apache Spark on this cluster and start processing data at a scale that would make your laptop beg for mercy.

Apache Spark will allow you to:

Process datasets much larger than your RAM capacity
Run complex data transformations in parallel across all your nodes
Use familiar Python syntax with PySpark
Scale your processing as your computational needs grow

The beauty of running Spark on your own hardware is that you're not watching a cloud billing meter tick up with every computation. You can experiment, learn, and process large datasets without worrying about costs.

Why This Matters: Beyond Just a Cool Project

Building your own data processing cluster isn't just a fun weekend project (though it absolutely is that!). It's also:

A learning laboratory: You'll gain hands-on experience with technologies that power modern data infrastructure
A stepping stone to bigger things: The skills you learn here translate directly to cloud environments
A practical solution: For those datasets that are too big for your laptop but not quite "rent a data center" big
A resource optimizer: Making use of computing power you already own but might be sitting idle

Have you built your own data processing setup at home? Are you planning to follow along with this guide? I'd love to hear about your experiences or answer any questions in the comments below. After all, the best part of self-hosting is the community that comes with it!

Assembling Your Data Processing Avengers

What We're Building & Why Spark Matters

In this article

Setup Server Node and Worker Node #1 (Linux PC)

Setup Worker Node #2 (Windows Laptop)

Create Kubernetes Cluster (Multi-Nodes)

Deploy Kubernetes Dashboard

What's Next: Unleashing Apache Spark on Your Cluster

Why This Matters: Beyond Just a Cool Project

You might also like...

From Pandas to the Cloud: How I Accidentally Became a Container Evangelist

My Data Engineering Wake-Up Call: From 3 AM Alerts to Streaming Serenity