Actually, these three projects cannot work together, dependency problems, it cannot work, so i will provide my findins with every framework alone.
https://docs.delta.io/latest/quick-start.html#prerequisite-set-up-java
Lets start with Delta-Lake alone, first, spark-sql:
┌<▸> ~/spark-3.5.0-bin-hadoop3
└➤
bin/spark-sql --packages io.delta:delta-spark_2.12:3.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
:: loading settings :: url = jar:file:/Users/aironman/spark-3.5.0-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/aironman/.ivy2/cache
The jars for the packages stored in: /Users/aironman/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-************************************;1.0
confs: [default]
found io.delta#delta-spark_2.12;3.1.0 in central
found io.delta#delta-storage;3.1.0 in central
found org.antlr#antlr4-runtime;4.9.3 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.1.0/delta-spark_2.12-3.1.0.jar ...
[SUCCESSFUL ] io.delta#delta-spark_2.12;3.1.0!delta-spark_2.12.jar (7140ms)
downloading https://repo1.maven.org/maven2/io/delta/delta-storage/3.1.0/delta-storage-3.1.0.jar ...
[SUCCESSFUL ] io.delta#delta-storage;3.1.0!delta-storage.jar (70ms)
:: resolution report :: resolve 894ms :: artifacts dl 7217ms
:: modules in use:
io.delta#delta-spark_2.12;3.1.0 from central in [default]
io.delta#delta-storage;3.1.0 from central in [default]
org.antlr#antlr4-runtime;4.9.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 2 | 2 | 0 || 3 | 2 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-************************************
confs: [default]
2 artifacts copied, 1 already retrieved (5398kB/12ms)
24/02/09 09:33:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/09 09:33:54 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
24/02/09 09:33:54 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
24/02/09 09:33:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
24/02/09 09:33:55 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore aironman@************
Spark Web UI available at http://************:4040
Spark master: local[*], Application Id: local-1707467632729
spark-sql (default)> 24/02/09 09:34:08 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
Now, spark-shell-3.5.0, delta-lake alone:
┌<▪> ~/spark-3.5.0-bin-hadoop3
└➤
bin/spark-shell --packages io.delta:delta-spark_2.12:3.1.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
24/02/09 10:00:38 WARN Utils: Your hostname, MacBook-Pro-de-Alonso.local resolves to a loopback address: *********; using ************ instead (on interface en0)
24/02/09 10:00:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/aironman/spark-3.5.0-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/aironman/.ivy2/cache
The jars for the packages stored in: /Users/aironman/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-************************************;1.0
confs: [default]
found io.delta#delta-spark_2.12;3.1.0 in central
found io.delta#delta-storage;3.1.0 in central
found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 177ms :: artifacts dl 5ms
:: modules in use:
io.delta#delta-spark_2.12;3.1.0 from central in [default]
io.delta#delta-storage;3.1.0 from central in [default]
org.antlr#antlr4-runtime;4.9.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-************************************
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/4ms)
24/02/09 10:00:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://************:4040
Spark context available as 'sc' (master = local[*], app id = local-1707469243781).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.5.0
/_/
Using Scala version 2.12.18 (Java HotSpot(TM) 64-Bit Server VM, Java 21.0.2)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.textFile("README.md").flatMap(line=>line.split(" ")).map(word=>(word,1)).reduceByKey(_+_).foreach(println)
(package,1)
(this,1)
(integration,1)
...
(command,,2)
(Hadoop,3)
scala> 24/02/09 10:01:00 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
scala> val data = spark.range(0, 5)
data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> data.write.format("delta").save("/tmp/delta-table")
24/02/09 10:06:12 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
scala> val df = spark.read.format("delta").load("/tmp/delta-table")
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show()
+---+
| id|
+---+
| 3|
| 4|
| 0|
| 1|
| 2|
+---+
scala> val data = spark.range(5, 10)
data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
scala> df.show()
+---+
| id|
+---+
| 9|
| 5|
| 7|
| 6|
| 8|
+---+
scala> import io.delta.tables._
import io.delta.tables._
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala>
scala> val deltaTable = DeltaTable.forPath("/tmp/delta-table")
deltaTable: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@20f8ff49
scala>
scala> // Update every even value by adding 100 to it
scala> deltaTable.update(
| condition = expr("id % 2 == 0"),
| set = Map("id" -> expr("id + 100")))
scala>
scala> // Delete every even value
scala> deltaTable.delete(condition = expr("id % 2 == 0"))
scala>
scala> // Upsert (merge) new data
scala> val newData = spark.range(0, 20).toDF
newData: org.apache.spark.sql.DataFrame = [id: bigint]
scala>
scala> deltaTable.as("oldData")
res7: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@2ba9b191
scala> .merge(
| newData.as("newData"),
| "oldData.id = newData.id")
res8: io.delta.tables.DeltaMergeBuilder = io.delta.tables.DeltaMergeBuilder@79c9090a
scala> .whenMatched
res9: io.delta.tables.DeltaMergeMatchedActionBuilder = io.delta.tables.DeltaMergeMatchedActionBuilder@53fb8fe0
scala> .update(Map("id" -> col("newData.id")))
res10: io.delta.tables.DeltaMergeBuilder = io.delta.tables.DeltaMergeBuilder@76f82103
scala> .whenNotMatched
res11: io.delta.tables.DeltaMergeNotMatchedActionBuilder = io.delta.tables.DeltaMergeNotMatchedActionBuilder@7377acf4
scala> .insert(Map("id" -> col("newData.id")))
res12: io.delta.tables.DeltaMergeBuilder = io.delta.tables.DeltaMergeBuilder@6f1d8552
scala> .execute()
scala>
scala> deltaTable.toDF.show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
scala> import io.delta.tables._
import io.delta.tables._
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> deltaTable.toDF.show()
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
scala> import io.delta.tables._
import io.delta.tables._
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> import spark.implicits._
import spark.implicits._
scala> deltaTable.delete(col("id").equalTo(0))
scala> deltaTable.toDF.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
scala> val df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
df: org.apache.spark.sql.DataFrame = [id: bigint]
scala> df.show()
+---+
| id|
+---+
| 3|
| 4|
| 0|
| 1|
| 2|
+---+
There is a lot more to do with Delta Lake, but it will be for the next day.
At the moment of this post, i got an exception using spark-3.4.2 and hudi-0.14.1, so i will try using a docker image. I can run a demo and i can build my own images.
Lets start with the demo. Watch out that spark version is 2.4.4, a bit outdated, so there is not any delta lake at the moment of this writing.
i will follow instrucctions from this url:
https://hudi.apache.org/docs/docker_demo
Actually you have to download sources from github, change to a stable branch, compile with maven and JDK8, after that you can run a demo:
┌<▸> ~/g/hudi
└➤
sdk use java 8.0.332-sem
Using java version 8.0.332-sem in this shell.
┌<▸> ~/g/hudi
└➤
mvn clean package -Pintegration-tests -DskipTests -Dscala-2.11 -Dspark2.4
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19:45 min
[INFO] Finished at: 2024-02-09T11:54:50+01:00
[INFO] ------------------------------------------------------------------------
┌<▸> ~/g/h/docker
└➤
./setup_demo.sh
Pulling docker demo images ...
[+] Pulling 140/17
✔ hiveserver Skipped - Image is already being pulled by hivemetastore 0.0s
✔ presto-coordinator-1 Skipped - Image is already being pulled by presto-worker-1 0.0s
✔ adhoc-2 Skipped - Image is already being pulled by adhoc-1 0.0s
✔ presto-worker-1 14 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 535.1s
✔ adhoc-1 5 layers [⣿⣿⣿⣿⣿] 0B/0B Pulled 376.2s
✔ sparkmaster 1 layers [⣿] 0B/0B Pulled 362.0s
✔ namenode 3 layers [⣿⣿⣿] 0B/0B Pulled 419.9s
✔ trino-worker-1 23 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 325.7s
✔ datanode1 11 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 331.1s
✔ kafka 11 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 460.8s
✔ graphite 5 layers [⣿⣿⣿⣿⣿] 0B/0B Pulled 417.0s
✔ hive-metastore-postgresql 14 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 325.9s
✔ hivemetastore Pulled 342.4s
✔ historyserver 3 layers [⣿⣿⣿] 0B/0B Pulled 339.0s
✔ zookeeper 11 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 260.5s
✔ spark-worker-1 21 layers [⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿] 0B/0B Pulled 362.0s
✔ trino-coordinator-1 1 layers [⣿] 0B/0B Pulled 335.7s
[+] Running 17/20
⠋ Network hudi Created 8.4s
⠏ Volume "compose_hive-metastore-postgresql" Created 8.3s
⠏ Volume "compose_historyserver" Created 8.3s
✔ Container kafkabroker Started 2.2s
✔ Container hive-metastore-postgresql Started 1.9s
✔ Container graphite Started 2.1s
✔ Container namenode Started 2.5s
✔ Container zookeeper Started 2.0s
✔ Container hivemetastore Started 2.4s
✔ Container historyserver Started 2.3s
✔ Container datanode1 Started 3.0s
✔ Container presto-coordinator-1 Started 3.3s
✔ Container trino-coordinator-1 Started 3.4s
✔ Container hiveserver Started 3.3s
✔ Container presto-worker-1 Started 6.0s
✔ Container sparkmaster Started 6.0s
✔ Container trino-worker-1 Started 6.0s
✔ Container adhoc-2 Started 6.9s
✔ Container adhoc-1 Started 7.2s
✔ Container spark-worker-1 Started 7.2s
Copying spark default config and setting up configs
Copying spark default config and setting up configs
┌<▸> ~/g/h/docker
└➤
docker container ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d3d699205150 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest "entrypoint.sh /bin/…" About a minute ago Up About a minute 0-1024/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:4040->4040/tcp, 0.0.0.0:5006->5005/tcp adhoc-1
c6de29455f1b apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkadhoc_2.4.4:latest "entrypoint.sh /bin/…" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:5005->5005/tcp adhoc-2
377bd66ffbea apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkworker_2.4.4:latest "entrypoint.sh /bin/…" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-8080/tcp, 8082-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8081->8081/tcp, 0.0.0.0:51041->5005/tcp spark-worker-1
9ae864e40096 apachehudi/hudi-hadoop_2.8.4-trinoworker_368:latest "./scripts/trino.sh …" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-8091/tcp, 8093-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8092->8092/tcp, 0.0.0.0:51038->5005/tcp trino-worker-1
8fe6afce701b apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.4.4:latest "entrypoint.sh /bin/…" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 6066/tcp, 7000-7076/tcp, 0.0.0.0:7077->7077/tcp, 7078-8079/tcp, 8081-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8080->8080/tcp, 0.0.0.0:51039->5005/tcp sparkmaster
9d2e1613f62a apachehudi/hudi-hadoop_2.8.4-prestobase_0.271:latest "entrypoint.sh worker" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp presto-worker-1
84d494981598 apachehudi/hudi-hadoop_2.8.4-trinocoordinator_368:latest "./scripts/trino.sh …" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-8090/tcp, 8092-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8091->8091/tcp, 0.0.0.0:51037->5005/tcp trino-coordinator-1
60af5db9add6 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest "entrypoint.sh /bin/…" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-9999/tcp, 10001-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:10000->10000/tcp, 0.0.0.0:51036->5005/tcp hiveserver
3d5ddebf93e0 apachehudi/hudi-hadoop_2.8.4-datanode:latest "/bin/bash /entrypoi…" About a minute ago Up About a minute (healthy) 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-10100/tcp, 50000-50009/tcp, 0.0.0.0:50010->50010/tcp, 50011-50074/tcp, 50076-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50075->50075/tcp, 0.0.0.0:51034->5005/tcp datanode1
c2e83b6d9113 apachehudi/hudi-hadoop_2.8.4-prestobase_0.271:latest "entrypoint.sh coord…" About a minute ago Up About a minute 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-8089/tcp, 8091-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8090->8090/tcp, 0.0.0.0:51035->5005/tcp presto-coordinator-1
216b42d0f0d2 apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest "entrypoint.sh /opt/…" About a minute ago Up About a minute (healthy) 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-9082/tcp, 9084-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:9083->9083/tcp, 0.0.0.0:51033->5005/tcp hivemetastore
98b60a509b55 apachehudi/hudi-hadoop_2.8.4-history:latest "/bin/bash /entrypoi…" About a minute ago Up About a minute (healthy) 0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8187/tcp, 8189-10100/tcp, 50000-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:58188->8188/tcp historyserver
37222f268fee apachehudi/hudi-hadoop_2.8.4-namenode:latest "/bin/bash /entrypoi…" About a minute ago Up About a minute (healthy) 0-1024/tcp, 4040/tcp, 5000-5004/tcp, 5006-5100/tcp, 7000-8019/tcp, 8021-10100/tcp, 0.0.0.0:8020->8020/tcp, 50000-50069/tcp, 50071-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:51032->5005/tcp namenode
5f258d440637 bde2020/hive-metastore-postgresql:2.3.0 "/docker-entrypoint.…" About a minute ago Up About a minute 5432/tcp hive-metastore-postgresql
4076a44cac06 bitnami/kafka:2.0.0 "/app-entrypoint.sh …" About a minute ago Up About a minute 0.0.0.0:9092->9092/tcp kafkabroker
fd3907643b41 graphiteapp/graphite-statsd "/entrypoint" About a minute ago Up About a minute *******:80->80/tcp, 2013-2014/tcp, 2023-2024/tcp, 8080/tcp, *******:2003-2004->2003-2004/tcp, *******:8126->8126/tcp, 8125/tcp, 8125/udp graphite
0f7ad6b9e84d bitnami/zookeeper:3.4.12-r68
The maven compile takes almost 40 minutes, so be patient!
You can stop the demo running the next script:
┌<▸> ~/g/h/docker
└➤
./stop_demo.sh
[+] Running 18/18
✔ Container kafkabroker Removed 0.5s
✔ Container presto-worker-1 Removed 3.0s
✔ Container datanode1 Removed 10.5s
✔ Container zookeeper Removed 0.4s
✔ Container trino-worker-1 Removed 10.5s
✔ Container spark-worker-1 Removed 10.6s
✔ Container adhoc-1 Removed 10.3s
✔ Container graphite Removed 6.7s
✔ Container adhoc-2 Removed 10.3s
✔ Container presto-coordinator-1 Removed 3.1s
✔ Container historyserver Removed 10.3s
✔ Container trino-coordinator-1 Removed 10.4s
✔ Container sparkmaster Removed 10.3s
✔ Container hiveserver Removed 10.3s
✔ Container hivemetastore Removed 0.7s
✔ Container namenode Removed 10.2s
✔ Container hive-metastore-postgresql Removed 0.2s
✔ Network hudi Removed