About how to work with RDDs using scala I am doing a big data course with formacionhadoop.com, concretly, online master big data expert, 150 hours and i am going to write this post in order to remember in the future how to work with RDDs using scala code, a pure functional language for the JVM.…
About an example using kafka, spark-streaming, mongodb and twitter
Hi everyone, in my process of mastering scala and big data technologies, i am learning how to integrate apache kafka, spark-streaming, mongodb and twitter. The idea is simple, a producer process trying to push json tweets from twitter within a kafka topic and another spark-streaming process trying to pull data from that topic and storing…
About how to interact with Mongo and Spark Streaming using scala
Hi, these last days i was working developing a solution related with MongoDb, Twitter4j, spark streaming and machine learning (kmeans) using scala. The project needs sbt to build it, and it is the continuation of the previous project related with cassandra, spark streaming and machine learning with scala, so if you want that sbt test works,…
About how to interact with Cassandra using scala
Hi again, finally i learned how to interact with a Cassandra Server using scala. I am learning this language in order to write efficient code in order to interact with Apache Spark and another big data tools, like Cassandra, Kafka, Flume, machine learning algorithms and so many others. The most impatient can download the project…
1. Introducción Buenas gente, en un tutorial que escribí no hace mucho, describí cómo configurar un cluster hadoop en modo pseudo distribuido, es decir, instalar en una misma máquina de prueba todos los componentes que conforman un cluster hadoop, a saber, NameNodes, DataNodes, TaskTrackers y JobTrackers. En el mundo real, estos componentes están…
Mis notas sobre apache spark
Spark fundamentals 1, mis notas Estas trabajando con la version 1.3.0, que es la version que Cloudera provee y mantiene. Actualmente la version actual es 1.4.1. Apache spark es un sistema de computo distribuido de proposito general, muy rapido pensado para ser usado en grandes clusters de maquinas. Provee APIS de alto nivel en varios…