Today, we all want websites and applications to respond to our requests quickly, if not immediately. As the usage of our application grows, it becomes more challenging to provide customers with a short response time. Caching has been the strategy to solve many problems, but in many cases, applications need to work with real-time data. This is so they can either react quickly (like fraud detection systems need to) or because the system is just a small portion of a more prominent integration. Platforms like Kafka comes handy in these type use cases.
Kafka has become popular in companies like LinkedIn, Netflix, Spotify, and others. Netflix, for example, uses Kafka for real-time monitoring and as part of their data processing pipeline. In today’s post, I’m going to briefly explain what Kafka is. And I’ll also list a few use cases for building real-time streaming applications and data pipelines.
What’s Kafka?
Kafka is a distributed platform system started by LinkedIn. It went open-source in 2011, and it’s been part of the Apache foundation since 2012. There’s also a company called Confluent that offers enterprise solutions for the Kafka ecosystem. Confluent is the company that has the most contributors to the Kafka project.
Kafka is a platform where you can publish data, or subscribe to read data—much like a message queue. But it’s more than that. All its data is stored in a fault-tolerant way, and you can process data in real-time. Therefore, you can use Kafka to build data pipelines that move data in real-time. Or it’s also a perfect fit when applications need to react quickly to specific events. For example, you could transform your traditional extract-transform-load (ETL) system into a live streaming data pipeline with Kafka.
You don’t have to think ahead of time about where the data is going, nor what to do with the data once it’s in Kafka. But let me give you a few examples of where Kafka is a good option.
Track User Behavior
You can track all users’ activity on your websites, like the products a user saw, which products were added to the cart without a purchase, or even how much time a user spends on a page. But the shopping cart example is too traditional.
What about a pay-per-click system like Google Ads? Kafka is a perfect match. Let me explain it in more detail.
When the ads are displayed to the user, you can track how many advertisements the user saw, in which position, and under which search criteria ads were chosen. You’d send that tracking context information asynchronously to Kafka. Users won’t notice it. Then, when a user clicks on the ad, you can redirect the user immediately to the advertiser site. You can track asynchronously again to Kafka. Once data is in Kafka, you can move the data to a Hadoop cluster for further analysis. Or, consume the data in real-time to adjust ads based on performance.
Communication Between Services
As part of improving performance in applications, as in a microservice architecture, a streaming platform like Kafka comes in handy.
There are going to be workflows where you don’t need the response from a microservice right away. Let’s say that a user places an order. The system will send the order event to Kafka. Then, another application will process the payment. If the payment is successful, the payment microservice sends another event to another topic in Kafka. Another microservice consumes events from that topic, sends an email confirmation and another event to start shipping the product. In this scenario, Kafka works as a message queue like RabbitMQ or AWS Kinesis.
In a microservices architecture, each microservice will use the publish and subscribe mechanisms from Kafka to interact with each other. With Kafka, you can decouple the architecture, and in case of a failure in any part of the system, the user might not even notice it. What can happen is that a user will have a delay in the email confirmation, for example.
Processing Data in Real-Time
A widespread use case for Kafka is to work with events in real-time.
Banks or any other system where someone can lose money because of fraud would need a real-time reaction. Let’s say that a credit card has been used to purchase products in different sites around the world. Every transaction can be sent to a Kafka topic, and based on the location of each transaction, the system will decide to block the credit card momentarily. Purchases made from different parts of the world are an indication of suspicious activity. For this to work, an application would need to subscribe to the Kafka topic and read data in real-time to make decisions rapidly and reduce the impact of spending more money illegitimately.
Nonetheless, another benefit of using Kafka is that you don’t need to build real-time subscribers from the beginning. Once events are coming to Kafka, you can defer the decision of what to do with the data and how to process it for a later time. For example, you can use Kafka to migrate from a batch-processing pipeline to a real-time pipeline.
IoT Data Analysis
Another scenario is to use Kafka as the central location to send and read data from IoT devices.
You could have several IoT devices sending data to Kafka, much like users visiting your website. If the fleet of devices grows, you can configure Kafka to scale out or scale in to fulfill peak loads of traffic. Let’s say that you put an IoT device in every train in a city. Each IoT device will be sending information about the train. For instance, about the status of critical parts of the engine, like a sensor. An application can subscribe to a Kafka topic to programmatically alert when a sensor exceeds a certain threshold. A train could be flagged as a candidate for maintenance and temporarily disabled for security reasons.
You wouldn’t want to wait for the end of the day to get all the information from each device. It might be too late. That’s why Kafka’s real-time capability is so valuable.
There’s a minor hitch. For IoT devices, you need to have a small code footprint. Using a conventional Kafka client (i.e. a Java library) won’t work because libraries are usually big. For that reason, these types of devices use the message queuing telemetry transport (MQTT) to publish messages. Therefore, you’d need to configure an MQTT broker to receive data from devices. Then, you’d need to connect Kafka to the MQTT broker. For that, you need a Kafka Connect MQTT or HiveMQ.
Again, once data is in Kafka you can analyze it in several ways.
Centralize Raw Logs Data
Kafka is also a good fit as a transport layer for raw log data.
Even though I’d always recommend you to use a centralized storage location for logs, Kafka comes handy when you need to distribute data for different purposes. Let’s say that you’re not happy with your current log aggregation solution, and you want to change it. But instead of only programming the switch, you can anticipate future changes too. Kafka could be used as a transportation point, where applications will always send data to Kafka topics. Then, you can decide what to do with the data. Write a consumer that aggregates data in real-time and automate alerts. Or, you can distribute the log data to many platforms at the same time.
When you’re happy with a solution, you can cut off the other alternatives without having to change a line of code in the applications that send log data to Kafka. You’d be decoupling the log data processing layer from the data producer layer.
A Distributable Streaming Platform
Kafka is well known for providing excellent performance at any scale, strong data durability, and low-latency queries. Many of the use cases I discussed throughout the post implement similar solutions. But the difference is how each application interacts with Kafka, and at what time in the data pipeline Kafka comes to the scene. You can use Kafka as a messaging system, a storage system, or as a streaming processing platform. You’ll be able to work with data from the past and the future in the same way. Transform the data as it arrives. And more importantly, query the data in a real-time fashion. Moreover, you can start treating your logs as events for further analysis. Especially when you’re debugging production systems of any type.
It’s hard to say that Kafka fits only to the use cases I mentioned in this post. There might be interesting stories of other applications out there.
Do you know any other interesting use case for Kafka?