What is Apache Kafka?

Discover Charmed Kafka

Everyone hates waiting in a queue. On the other hand, when you’re routing an absolute storm of event data around a cloud environment, message queues are your best friend. Enter Apache Kafka.

Apache Kafka is a free, open source event streaming platform that enables you to create queues for temporary buffering of large volumes of data. That’s about it – it performs one critical task within modern distributed systems engineering, really well. Let’s look at some of the significant benefits, challenges and use cases of Apache Kafka, and the easiest way to get it running in production.

Apache Kafka basics

Kafka enables you to group events with a common theme into a logical structure called a topic. Producers publish events to a given topic. Consumers subscribe to a topic and read the events that have been published to the topic. In this regard, Kafka is like many other publish/subscribe messaging systems.

However, unlike many alternative solutions, Kafka is a clustered system, meaning that you can run Kafka on several servers (Kafka brokers) and the brokers will divide the work between themselves. They do this by dividing topics into partitions and assigning different partitions to different brokers. Each topic partition will have one leader broker assigned at any given point in time.

In order to ensure high availability, fault tolerance and also to improve performance, Kafka can be instructed to copy the data in each topic partition to one or more topic partition replicas, running on other Kafka brokers. This process is called replication. Typically, you’ll replicate the contents of the topic partitions twice, so that there are always 3 copies of each message stored in the cluster; however you may choose to use a lower level of replication for less critical data, or a higher level of replication as required.

Kafka can be configured for topology awareness, which ensures that topic partition replicas are always placed on brokers that are situated in different server racks, or in different halls of a data centre. This way, if there is an outage of a whole rack or indeed a whole server hall, the data is still available on nodes in other racks where the cluster is running.

Kafka provides scalability

Kafka solves scalability challenges as partitions of a topic can independently manage read and writes from subscribers and producers. Partitions let Kafka perform multiple reads and writes. Partitions also maintain speed as the number of subscribers and producers increases. Finally, Kafka keeps order when various producers “write” to the same topic at the same time, thereby ensuring that event sequencing is not lost.

Scalability achieved with partitions enables organisations to grow their service platform’s capacity by adding more Kafka brokers, and makes it easier to add new subscribers or publishers to existing topics. Apache Kafka scales both vertically and horizontally, which means by adding more compute capacity (more cores, RAM and disks) to the server or by adding additional servers, respectively.

Through all of these techniques, Kafka has become the de-facto open source messaging solution for business and mission-critical workloads.

How is Apache Kafka used?

We have learned the fundamental concepts of Apache Kafka and why it is advantageous to include in cloud environments. Now, let’s focus on practical applications:

  • Stream processing: Kafka enables you to create real-time data streams. Subscribing applications can process data and transform it, before re-publishing the data to other subscribing applications.
  • Fast message queue: In a literal sense, it can be used to send and receive messages. More generically, it allows message passing in a microservice architecture. Kafka moves messages without knowing the format of the data, and this means it can do so very fast – endpoints decode data with no overhead in the transit process.
  • Data aggregation: Kafka enables you to combine multiple sources of log data by making a common topic with multiple producers writing to it. The platform solves the complexity of having numerous producers append to a time-sensitive log, and has in-built functionality to arbitrate clashes.

Challenges of using Apache Kafka

We know Apache Kafka has features for scalability (partitioning) and reliability (replication). However, applying these elements to a real world use-case takes careful planning and architectural design.

Primarily, there are physical constraints to how scalable and reliable Kafka can be. For example, the solution needs to be appropriately dimensioned in order to make sure there is adequate network capacity and fast disk storage. 

Secondly, you need to be mindful when selecting the number of partitions for a topic. If the number is too high, it will slow the system down. Too low, and publishers or subscribers will stall in getting access to a topic.

Finally, replication will only lead to high availability and improved reliability if replicas are stored in different servers, racks, data centre halls or availability zones. Replication therefore requires appropriate physical environments that meet the business requirements.

Introducing Charmed Kafka

Canonical Charmed Kafka is an advanced solution for deploying, operating and maintaining Kafka, suitable for enterprise workloads. The solution includes a Canonical-backed distribution that tracks the upstream Kafka software and an automation suite that guides and assists with:

  • Automated deployment of Kafka clusters on cloud servers and in on-premise data centres
  • Configuration of Kafka clusters and the underlying host servers according to best practices
  • Securing the cluster with TLS, mTLS and SCRAM
  • Automation for scaling out Kafka clusters by adding additional cloud servers, as well as scaling the cluster back in by decommissioning servers when capacity is no longer required
  • Integrating with the Canonical Observability Stack for logging, monitoring and alerting, with ready-to-go dashboards and alerts for Kafka
  • Automates upgrades

The entire Charmed Kafka solution is free, open source software. Customers can get paid support for the solution backed by up to 10 years coverage per stable track. Support covers break/fix, troubleshooting and security coverage for critical and selected high CVEs, and comes with a choice of weekday or 24/7 SLAs. Learn more about Charmed Kafka.

Summary

Apache Kafka offers a general-purpose backbone for all your solution’s data needs. It includes pragmatic solutions to get enterprise reliability and scalability needed in any cloud environment. It is flexible enough to be essential for many use-cases. To optimise your deployment, improve quality and economics, speak to our team today

Further reading

Why we built a Spark solution for Kubernetes

Big data security foundations in five steps

Running MongoDB on Kubernetes

Ubuntu cloud

Ubuntu offers all the training, software infrastructure, tools, services and support you need for your public and private clouds.

Newsletter signup

Get the latest Ubuntu news and updates in your inbox.

By submitting this form, I confirm that I have read and agree to Canonical's Privacy Policy.

Related posts

Managed Apps on Public Cloud: Why Operations Matter, Part II

In the first part of this blog journey (I’d call it a post, but it’s actually two posts) we explored what operational excellence looks like in public cloud...

Managed Apps on Public Cloud: Why Operations Matter, Part I 

You might be tempted to think that running an app on a public cloud means you don’t need to maintain it. While that would be wonderful, it would require help...

Canonical announces the general availability of Charmed Kafka

27 February 2024: Today, Canonical announced the release of Charmed Kafka – an advanced solution for Apache Kafka® that provides everything users need to run...