Installing and using Kafka

What is Kafka?

Kafka is used for building real-time data pipelines and streaming apps. Kafka can be used to:

  1. Publish and Subscribe: read and write streams of data messages like a messaging system.
  2. Process: Write scalable stream processing applications that react to events in real-time.
  3. Store: Store streams of data safely in a distributed, replicated, fault-tolerant cluster.

Use Cases

Here’s an excellent article that talks about logs, distributed systems, data integration, real time processing and system building.
1. Messaging
Kafka works well as a replacement for a more traditional message broker. In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.

2. Website Activity Tracking
The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type.

3. Metrics
Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.

4. Stream Processing
Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.

5. Event Sourcing
Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.

6. Commit Log
Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.

Install, run and test Kafka

1. Install JDK

apt-get install default-jdk

2. Get the latest Binary release of Kafka server from https://kafka.apache.org/downloads and use wget to download on the server

wget https://www-eu.apache.org/dist/kafka/2.3.0/kafka_2.12-2.3.0.tgz

3. Extract and remove the tar file

tar -xvf kafka_2.12-2.3.0.tgz
rm kafka_2.12-2.3.0.tgz

That’s all you really need to do to install Kafka. Now we are ready to run the zookeeper and the Kafka server. First navigate to the Kafka directory.

4. Start Zookeeper in Kafka

bin/zookeeper-server-start.sh config/zookeeper.properties

You’ll notice the following output showing that Zookeeper uses port 2181 by default
binding to port 0.0.0.0/0.0.0.0:2181

5. Start the Kafka server

bin/kafka-server-start.sh config/server.properties

6. Although you would use Kafka’s connect API’s to setup producers and consumers, but for the sake of demonstration let’s use the console producer and console consumer to broadcast and pull messages from the queue. Every line in the console is treated as a new message.

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic MyFirstTopic
>{"key":"value"}
>{"key1":"value1"}

7. Start the Kafka command line consumer to print the messages on the console

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic MyFirstTopic
{"key":"value"}
{"key1":"value1"}