Docker is my favorite technology of all time. Well not just docker but containers in general are very cool. And something that I use for almost every new tech that I want to learn. Very quick very easy very efficient. Of course there is a learning curve, but once you do get over the curve its pretty fucking awesome.
So for this blog entry I am just going to go through the steps that I took to set up a simple containerized pipeline that makes use of Apache Kafka and Apache Spark.
I decided to use bitnami images for kafka and spark, as they come preconfigured with many of the settings. A year ago the image I remember using was wurstmeister for both kafka and zookeeper. But recently I learned that there’s something called KRaft (Kafka Raft Metadata mode) which was introduced to remove Apache Kafka’s dependency on ZooKeeper for metadata management.
First service in the docker-compose file. listing the kafka service
services:
kafka:
image: "bitnami/kafka"
container_name: kafka
ports:
- "9092:9092"
environment:
- KAFKA_CFG_NODE_ID=0
- KAFKA_CFG_PROCESS_ROLES=controller,broker
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093
- KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
- KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=0@kafka:9093
- KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
This runs in KRaft mode meaning needs no zookeeper. Also this is a very basic kafka setup. One broker. we can have multiple brokers for replication. Since this is for learning purpose it’s not secured. 9092 is the port on which we can interact even with kafka using our host machine. Container name is kafka.
I prefer using docker-compose as it is more cleaner and concise than just using shell terminal. Also in a multi container setup it deals with networking aspects by itself, of course we can create custom networks, but its just more efficient this way.
Kafka raft server has started.
Creating topics and publishing to the topic using the shell.
docker exec -it kafka kafka-topics.sh --create \
--topic TEST \
--bootstrap-server kafka:9092 \
--partitions 1 \
--replication-factor 1
Replication factor can only be 1 here since we have only one broker
Publishing to TEST topic
docker exec -it kafka kafka-console-producer.sh \
--bootstrap-server kafka:9092 \
--topic TEST
Subscribing to TEST topic
docker exec -it kafka kafka-console-consumer.sh \
--bootstrap-server kafka:9092 \
--topic TEST
You can use --from-beginning after topic name to get all the messages published to the topic from beginning