Apache Spark with Frank Kane.

Another session with Apache Spark, with Frank Kane. Install apache spark. brew install apache-spark aironman@MacBook-Pro-de-Alonso ~> cd /usr/local/Cellar/apache-spark/2.4.4/libexec/conf/ aironman@MacBook-Pro-de-Alonso /u/l/C/a/2/l/conf> mv log4j.properties.template log4j.properties change INFO error level to ERROR aironman@MacBook-Pro-de-Alonso /u/l/C/a/2/l/conf> spark-shell # how many lines has this file? scala> sc.textFile("testSpark.txt").count res0: Long = 1 Download this set of files. http://files.grouplens.org/datasets/movielens/ml-100k.zip New scala project, new…

Advertisement

Spark, rendimiento, shuffle

Es mejor usar repartition() en un Dataframe o partitionBy() en un RDD antes de ejecutar una operacion larga y costosa. operaciones como join(), cogroup(), groupWith(), join(),leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), lookup() pueden ganar mucho si acertamos en el particionamiento. val moviePairs = ratings.as("ratings1") .join(ratings.as("ratings2"), $"ratings1.userId" === $"ratings2.userId" && $"ratings1.movieId" < $"ratings2.movieId") .select($"ratings1.movieId".alias("movie1"), $"ratings2.movieId".alias("movie2"), $"ratings1.rating".alias("rating1"), $"ratings2.rating".alias("rating2")…