Recently I have been very busy for health reasons in myself, taking care of myself, going to the hospital to receive treatment, taking care of my parents too, who are already an age, so I have not been able to devote as much time as one would like to the blog, but some time I have been able to take out to read and to program a little.

I’ve been reading a book about Kafka that isn’t for sale yet and programming a bit about how you can use kafka with some data streaming technologies, like Apache Spark, Apache Flink and Kafka Streams.

What are the differences between them? Well, above all they are the most specific to each technology, Spark is a generalist distributed processing framework that includes streaming tools and machine learning tools, while Apache Flink is equivalent to the part dedicated to data streaming of Apache Spark, only that Spark performs the streaming through microbatches in memory working in near real time while Apache Flink does the data processing as soon as they reach the node, allowing real time processing.

Each one has its advantages and disadvantages, the main advantage of Flink with respect to Spark is that its performance when performing superior data streaming, is much faster than Spark, mainly due to the microbatch nature of Spark while Flink is real time processing, disadvantages? is only used for data streaming, its development api is not as mature as Spark’s, you make it possible to work natively with parquet files, avro, although it could be achieved by including some dependence on third parties in pom.xml or build.sbt. Also, unless it only serves to stream data, it would be nice if it also included deep learning and machine learning libraries.
Personally I have spent more time with Apache Spark, so I prefer to use it in case I have to do distributed processing in a cluster.

Kafka Streams is somewhat different from the other two because it mainly serves to communicate different kafka topics to each other in a unified way, that is, with the same api in the same program you can process and transform data between different kafka topics without having to resort to third party technologies such as Spark, Flink, Storm.

The code is in scala, the master branch is dedicated to the part of how kafka should interact with spark and kafka with Flink and there is a branch called chapter3 that has the code that shows how to configure an example Kafka Streams. The reason why kafka streams is not included in the master branch is because Kafka Streams is not compatible with Apache Spark at runtime, i.e. it breaks at runtime because Apache Spark 2.4.1 uses a slightly outdated version of Jackson-core, version 2.6.7.1 and KafkaStreams uses a more updated version, version 2.9.8.

In this file I have included tips compiled on how to programmatically configure a spark-streaming process with kafka, tips to take into account when adapting your spark process in your specific cluster, since each cluster has its characteristics of memory, cores and spark can be configured programmatically like kafka but the latter there are certain characteristics that can only be done by modifying its configuration file server.properties.

Over time, as I get more chapters to read, I will include more code in the repository.

Have fun, stay healthy and keep learning, we read each other.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s