About how to parallelize multiple Machine Learning Algorithm using a pipeline with spark.

You basically need to make a Pipeline and build a ParamGrid with different algorithms as stages.  Here is an simple example: val dt = new DecisionTreeClassifier() .setLabelCol(“label”) .setFeaturesCol(“features”) val lr = new LogisticRegression() .setLabelCol(“label”) .setFeaturesCol(“features”) val pipeline = new Pipeline() val paramGrid = new ParamGridBuilder() .addGrid(pipeline.stages, Array(Array[PipelineStage](dt), Array[PipelineStage](lr))) val cv = new CrossValidator() .setEstimator(pipeline) .setEstimatorParamMaps(paramGrid) […]

About how to create a table in Hive using a data frame

Hi everyone, as usual, I put here a recipe to create fast a table in Hive using the data from a  data frame. I am using a spark-shell connected to a development cluster, cloudera version is cdh5.5.2, so, according to this official cloudera site, hive version is 1.1.0. It should be important when I have […]

About how to work with RDDs using scala

About how to work with RDDs using scala I am doing a big data course with formacionhadoop.com, concretly, online master big data expert, 150 hours and i am going to write this post in order to remember in the future how to work with RDDs using scala code, a pure functional language for the JVM. […]