I was thinking about my period at my last job, in a company where they provide clients with software based on daily ELT and ETL processing of the antennas to create both avro files in the laning stage and finally parquet files already consolidated and aggregated with the We have to finally create tables in Apache Impala that then serve as a source of truth or data source for a series of classic microservices written with the classic spring-boot stack on postgres.

All this after going through a CICD process where the classic stack appears, git, maven, jenkins, tests of all kinds, including regression, sonarqube execution, docker image creation, deployment in a pre-production environment, monitoring and finally sending emails to all the people involved.

We can talk about that CICD process, in my opinion it is incomplete because there would be a lack of processes to check the integrity and security of the containers created, but today I prefer to focus on this classic big data architecture for data collection.

I know that this configuration for daily processing of antenna information is a classic because I already found it in another client, and now that I know something more, I realize that this proposal is not appropriate because first of all in the dashboards you have to be able to show real-time data must be possible and at the same time it must be consistent data, tolerant to partition failures and, if possible, work even if some node is not available, all of this in a distributed environment because we are talking about having very large files. large, potentially TB of daily data and therefore we need a cluster where we can distribute the processing load and partitioned data.

The classic initial proposal is primarily batch processing, that is, from the previous day, or in the best case a few hours before, we never saw the data from the antennas in real time, more than anything because above all the purpose of this solution is knowing where there is a problem, but it is not imperative to know in real time where that problem is happening.

A proposal:

Instead of saving the processed and aggregated data from the antennas using spark in Apache Impala, a possible better solution could be to use Spark streaming/Structured Streaming to aggregate the data coming from different antennas. These files are usually already in several file systems previously consolidated in some external system, so ideally I would try to cut that step to directly connect the Spark consumer with that antenna data producer, and then go through a series of phases where data processing, data aggregation to finally create the parquet files, so that the sources of truth for Delta Lake are created. Currently something like this works but with Apache Impala, something that in my opinion is a mistake if you want consistent distributed data, but Apache Impala is not designed for that purpose. Delta Lake is, allowing distributed ACID read transactions and at the same time using Apache Hudi in the hadoop cluster to allow inserts, updates and deletes on said originally created tables. That way you get the best of all worlds, real-time processing using spark streaming, reading distributed ACID data using Delta Lake, updating real-time distributed ACID data using Hudi.

If inconsistency errors did not occur in the initial process of creating parquet files, something like Apache Hudi would not be necessary, but the truth is that it does happen, in my opinion due to the distributed nature of the business and perhaps also due to relaxation when it comes to following SOLID when we build this software.
Hudi would be necessary to try to find and update these inconsistencies already in the delta lake, in a preparation phase.

In short, move from a batch architecture to a batch/real-time lambda architecture, so that if it is necessary to reprocess data from a specific day, you can do so, but normal operation must be based on data streaming. A kappa style in which there is only real-time processing takes away the possibility of processing data from the past, and that is something that companies usually do not want.

Possible advantages of Delta Lake and Hudi over Spark/Impala:

1) Real-time processing with Spark Streaming/structured streaming: Using Spark Streaming allows us to process and aggregate data in real time as it arrives from different antennas. This enables rapid response to changes in data and the ability to analyze and act on it almost instantaneously.

2) Efficient storage with Parquet and Delta Lake: Storing processed data in Parquet files provides efficiency in data storage and querying. Additionally, taking advantage of Delta Lake to manage Parquet files allows you to get additional features, such as ACID transactions and data versioning, which ensures data integrity and consistency. Apache Impala does not provide ACID transactions, it does not guarantee that there is consistent data in the parquet file partitions, so it is possible that there will be queries where the data is inconsistent. This causes surprise and discomfort in the operators, so it was necessary to solve the part of the creation of the parquet files, which in turn are greatly influenced by the dynamic nature of the traffic between antennas. A lot of time was wasted finding possible bugs in the code, but the real problem is having chosen a tool that is not designed for ACID transactions. Delta Lake is.

3) Managing changing data with Apache Hudi: Using Apache Hudi to manage data allows you to perform insert, update and delete operations efficiently and consistently in a distributed environment. This is especially useful when you need to handle real-time updates to your stored data. It comes to naturally support those possible inconsistent data by supporting Delta Lake. That is, we let Delta Lake take care of the distributed transactionality when making distributed consistent reads and if we have to update said data, we delegate it to hudi.

4) Flexibility and scalability: This architecture provides great flexibility and scalability to adapt to different needs and workloads. You can easily adjust the system to handle larger volumes of data or add new data sources without significantly modifying the overall architecture.

Possible improvements:

If I need to be able to change schemas quickly and access historical data I could also use Apache Iceberg.

Apache Iceberg is an open source database designed to manage large data sets, providing an optimized file format for analytical queries and support for evolutionary schemas. The ACID queries are provided to me by Delta Lake, so I am more interested in its advanced ability to evolve schemas, more advanced than what Delta Lake allows me by itself.

Here are some advantages of incorporating Apache Iceberg into the streaming architecture:

1) Evolutionary Schemas: Iceberg allows for fast and controlled schema changes, meaning you can update or modify the schema of your data without disrupting existing read and write operations. This is crucial in environments where data schemas evolve over time.

2) Historical data management: Iceberg includes built-in support for data versioning, allowing you to access historical versions of your data. This is useful for performing retrospective analysis, debugging, and compliance with regulations that require historical data.

3) Query Optimization: Iceberg is designed to provide optimal performance on analytical queries, using techniques such as column pruning and indexing to speed up queries. This allows you to efficiently perform complex analyzes on large data sets.

4) Integration with other Hadoop ecosystem projects: Iceberg integrates well with other big data projects such as Apache Spark, Apache Hudi, and Apache Parquet, allowing you to take advantage of these tools while taking advantage of Iceberg’s additional features.

What are the disadvantages of using this combination of technologies instead of a more traditional batch architecture that does not guarantee distributed ACID processing in data readings?

1) Architectural complexity: The combination of Spark Streaming, Delta Lake, Apache Hudi, and Apache Iceberg can result in a more complex architecture compared to a more traditional batch processing approach. This can increase the learning curve and operational complexity of the system.

2) More resources required: Using real-time processing technologies like Spark Streaming can require more resources in terms of CPU, memory, and storage compared to a batch processing approach. This can increase operational costs and the complexity of infrastructure management.

3) Higher risk of errors: With a more complex and diverse architecture, there is a higher risk of errors in system development, implementation and maintenance. This may require greater attention to code quality, extensive testing, and careful management of changes and updates.

4) Lower performance in some operations: Although the technologies I have proposed are optimized for different types of operations (for example, Spark Streaming for real-time processingl, Delta Lake and Iceberg for efficient storage and query, and Apache Hudi for distributed ACID writes), there may be scenarios where performance is not optimal compared to a more traditional batch processing approach. This could be due to the additional complexity of the architecture or the specific characteristics of the technologies involved. Although we are basically talking about leaving behind a type of technology like Apache Impala, which in itself is an entire cluster, in case we delegate its use to a more modern version of Spark such as Delta Lake, along with a series of additions in the form of libraries. which give certain characteristics that are, in my opinion, very beneficial for consistent and distributed streaming in real time.

As with everything, in the end it is about testing and measuring against the past. In principle, I think this approach has more advantages than disadvantages. The main disadvantage that I see is that the state of the technology may not be mature enough for the three technologies to work well together, so there is no choice but to create prototypes. I wonder if anyone in the world has the necessary resources. I would love to set up a cluster of this type to do real-time processing.

Deja un comentario