Clarity code challenge

Some time ago, in a galaxy far far away, a company proposed me to do a challenge that consisted in creating a real time log file processing architecture, not in describing what one would look like, but in creating it from scratch. Obviously that felt like an attempt to rip me off, to steal work,…

Advertisement

Ajustando el número de particiones en un trabajo Spark

Es mejor usar repartition() en un Dataframe o partitionBy() en un RDD antes de ejecutar una operacion larga y costosa. operaciones como join(), cogroup(), groupWith(), join(),leftOuterJoin(), rightOuterJoin(), groupByKey(), reduceByKey(), combineByKey(), lookup() pueden ganar mucho si acertamos en el particionamiento. val moviePairs = ratings.as("ratings1").join(ratings.as("ratings2"), $"ratings1.userId" === $"ratings2.userId" && $"ratings1.movieId" < $"ratings2.movieId").select($"ratings1.movieId".alias("movie1"),$"ratings2.movieId".alias("movie2"),$"ratings1.rating".alias("rating1"),$"ratings2.rating".alias("rating2")).repartition(100).as[MoviePairs] Hay que jugar con ese…

Working with spark, again.

It has been a while since I last published something, so I am going to refresh how to work with spark. These are some notes that I think I will need some day. I start with the basic operation with the Dataframes. I am going to run a spark-shell, actual version is 2.4.4. // It…