batch-streaming-data-pipeline

Building both a batch and streaming data pipeline using Kafka, docker, Spark and AWS.

PROJECT DESCRIPTION

In this project, we created continous streams of data using the python's Faker library. These streams were then pushed to a Kafka topic deployed on a remote docker container. When these data streams come into the Kafka topic, the streams are then transformed using Apache Spark's structured streaming. This transformed data is converted to a dataframe and pushed into Amazon S3 buckets as a part csv file. These csv files in the s3 bucket are then processed by Aws Glue on a daily basis (batch) and the processed data moved into another s3 bucket.

PROCEDURES

Create a kafka topic from the Docker container. You can start the docker container by running

docker-compose up -d

To see if the container is running, you can run docker ps This lists out all the running containers.

From here, you will be able to get the container id, container name, ports and so on.

You can create the kafka topic by navigating to the docker container

docker exec -it {container_id} bin/bash

This command opens up a terminal similar to your regular terminal. This terminal consists of all the directories. You can change directory into opt/bitnami/kafka/bin and run the command

kafka-topics.sh
--bootstrap-server localhost:9092
--topic stream_batch_topic
--partitions 3
--replication-factor 1
--create

Here, we created a topic with name stream_batch_topic

The code for producing messages to this topic is in streaming_pipeline/producer.py

The spark_consumer.py was used to push the live streams into S3 bucket. You can modify the .env file and put in your bucket name and details there.

The output stream in the s3 bucket is shown in the image below

A lambda trigger was then created such that any file dropped into s3 , on failure or success, triggers an AWS SNS topic.

A glue job was also created to process the streamed s3 data daily in batches and upload it to another s3 bucket.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
streaming_pipeline		streaming_pipeline
.env		.env
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

batch-streaming-data-pipeline

PROJECT DESCRIPTION

PROCEDURES

About

Uh oh!

Releases

Packages

Languages

Techtacles/batch-streaming-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

batch-streaming-data-pipeline

PROJECT DESCRIPTION

PROCEDURES

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages