Building both a batch and streaming data pipeline using Kafka, docker, Spark and AWS.
In this project, we created continous streams of data using the python's Faker library. These streams were then pushed to a Kafka topic deployed on a remote docker container. When these data streams come into the Kafka topic, the streams are then transformed using Apache Spark's structured streaming. This transformed data is converted to a dataframe and pushed into Amazon S3 buckets as a part csv file. These csv files in the s3 bucket are then processed by Aws Glue on a daily basis (batch) and the processed data moved into another s3 bucket.
Create a kafka topic from the Docker container. You can start the docker container by running
docker-compose up -d
To see if the container is running, you can run docker ps This lists out all the running containers.
From here, you will be able to get the container id, container name, ports and so on.
You can create the kafka topic by navigating to the docker container
docker exec -it {container_id} bin/bash
This command opens up a terminal similar to your regular terminal. This terminal consists of all the directories. You can change directory into opt/bitnami/kafka/bin and run the command
kafka-topics.sh
--bootstrap-server localhost:9092
--topic stream_batch_topic
--partitions 3
--replication-factor 1
--create
Here, we created a topic with name stream_batch_topic
The code for producing messages to this topic is in streaming_pipeline/producer.py
The spark_consumer.py was used to push the live streams into S3 bucket. You can modify the .env file and put in your bucket name and details there.
The output stream in the s3 bucket is shown in the image below

A lambda trigger was then created such that any file dropped into s3 , on failure or success, triggers an AWS SNS topic.

A glue job was also created to process the streamed s3 data daily in batches and upload it to another s3 bucket.

