Why we built a Spark solution for Kubernetes

robgibbon

on 17 October 2023

Tags: apache spark , Big Data , data fabric , spark

This article is more than 1 year old.

We’re super excited to announce that we have shipped the first release of our solution for big data – Charmed Spark. Charmed Spark packages a supported distribution of Apache Spark and optimises it for deployment to Kubernetes, which is where most of the industry is moving these days.

Reimagining how to work with big data

Having the opportunity to rethink how big data is processed meant that we could challenge the status-quo based on the more traditional Hadoop YARN stack. And with our Charmed Kubernetes and MicroK8s systems, we have a greatly simplified, yet powerful family of cluster managers to enable full stack deployment of big data clusters and a consistent user-experience across local, on-premises and cloud environments. Of course, you’re not limited to our Kubernetes distributions – you can use Charmed Spark on other conformant Kubernetes – for example AWS EKS.

And reimagining big data storage too

For storage we chose S3 API-compliant Ceph instead of Hadoop HDFS storage system, although the solution is designed to work with most S3 compatible scale-out storage solutions. HDFS has many problems such as its NameNode with the entire inode map of the big data filesystem held in Java heap or the NameNode’s active/passive failover architecture. We opted to sidestep these and adopt more contemporary object storage solutions as the preferred backing tier for our solution. With modern, high capacity networking (for example > 100GbE), bits can typically be shifted to and from the Spark cluster faster than they can be processed by the Spark cluster, so the HDFS design paradigm of bringing the compute problem to the data makes less sense nowadays. Of course, users can still connect Charmed Spark to HDFS if they so wish.

Simplifying operations

In terms of operations, we wanted to keep the user experience as true to upstream Apache Spark as possible, so that users can drop in our runtime as a replacement for the upstream Spark Kubernetes container image with minimal fuss. CLI commands like `spark-submit`, `pyspark` and `spark-shell` work exactly as you would expect. We provide an Ubuntu snap package with client tools to help you get started quickly and easily and this can be installed on the edge nodes of your big data cluster.

The snap package also includes our spark8t Python library and CLI for managing service accounts and profiles for jobs on your Kubernetes cluster. Our aim with this tool is to make the lives of cluster admins, data engineers and data scientists a little bit easier by allowing them to preconfigure Spark job settings for different types of workloads and for the different Kubernetes service accounts that the jobs will run under.

We also offer a Juju Charm for Spark History Server. A Juju Charm is like a copilot for an application, and it contains codified knowledge about how to operate it. This one helps you to deploy and operate the Spark History Server on Kubernetes in a straightforward way. Read the docs to get started. Juju is a powerful system for day-2 operational management of complex distributed systems on clouds and on Kubernetes. We’ll be adding more Juju Charms to our Spark solution that cover more functionality over time.

Get working fast

We’ve integrated JupyterLab into the Charmed Spark solution, so that you can conveniently spin up a Jupyter environment on an edge node using Docker and have it start a spark session on your MicroK8s cluster, to make it even easier to work with Spark on K8s. Learn how to use JupyterLab with Charmed Spark.

The full documentation suite for Charmed Spark is available at canonical.com/data/docs/spark/k8s and we also have a reference architecture guide that you can download.

Good to know – we offer enterprise grade paid support on the entire solution through our Ubuntu Pro + Support subscription which covers up to 10 years of break/fix support and security maintenance per major release, in line with our wider commitment to long term support. If you’re interested in learning more, contact our sales team via the form or call us. We can also offer help with solution deployment through our fixed-fee deployment services – learn more here. Community support is available via our chat server and our community forum.

Why we built a Spark solution for Kubernetes

robgibbon

Reimagining how to work with big data

And reimagining big data storage too

Simplifying operations

Get working fast

More Data Fabric solutions to come

Talk to us today

Newsletter signup

Related posts

Big data security foundations in five steps

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

Charmed Spark beta release is out – try it today