Working with spark, again.

It has been a while since I last published something, so I am going to refresh how to work with spark. These are some notes that I think I will need some day. I start with the basic operation with the Dataframes. I am going to run a spark-shell, actual version is 2.4.4. // It…

Advertisement

About how to parallelize multiple Machine Learning Algorithm using a pipeline with spark.

You basically need to make a Pipeline and build a ParamGrid with different algorithms as stages.  Here is an simple example: val dt = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("features") val lr = new LogisticRegression() .setLabelCol("label") .setFeaturesCol("features") val pipeline = new Pipeline() val paramGrid = new ParamGridBuilder() .addGrid(pipeline.stages, Array(Array[PipelineStage](dt), Array[PipelineStage](lr))) val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid)…