Hi everyone, in my process of mastering scala and big data technologies, i am learning how to integrate apache kafka, spark-streaming, mongodb and twitter.

The idea is simple, a producer process trying to push json tweets from twitter within a kafka topic and another spark-streaming process trying to pull data from that topic and storing them into a mongo instance.

I have adopted the second approach for the streaming process, named Direct approach, because with this approach if the producer dies, the streaming process is waiting for them to continue processing tweets, it is more convenient. With the first approach, named Receiver-based, the streaming process would die if the producer dies or the producer does not have more data to push into the topic.

You have to edit the file named src/main/resources/reference.conf with your own data before you can run the different processes.

The code of the project is located here, so download it and have fun!.



One thought on “About an example using kafka, spark-streaming, mongodb and twitter

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s