About big data, hadoop map reduce and the datos.gob.es initiative

Finally i can continue with this post, a sample with a big data technology, for example, a java map reduce task running on apache hadoop.

First at all, you need to install hadoop, and i have to say that it is not trivial, depending of your SO, you may install it with apt, yum, brew, etc… or like i did, downloading a vmware image with all necessary stuff. There are some providers, like Cloudera or IBM BigInsights. I choose the last one because of i learn big data concepts  in bigdatauniversity.com, an iniciative from IBM.

Once downloaded the big insights vmware image, you can launch the boot, login with biadmin/biadmin and then click on Start BigInsights button, after few minutes, hadoop will be up and running. Go to http://bivm:8080/data/html/index.html#redirect-welcome in the firefox big insights and you can see it.

Once you have a hadoop cluster to play, it is time to code something, but first, you need to analyze the text, i put a little text, but real data are terabytes, hexabytes or more data with this format, thousands of billions lines with this format:

 id ; Agente Registrador   ; Total dominios;

 1  ; 1&1 Internet    ; 382.972;

 36 ; WEIS CONSULTING    ; 4.154;

 71 ; MESH DIGITAL LIMITED ; 910;

This is the mapper, the purpose of the mapper is to create a list with keys and values.

 

public class DominiosRegistradorMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {

 

privatestaticfinal String SEPARATOR = “;”;

 

@Override

public void map(LongWritable key, Text value, Context context) throws IOException,

InterruptedException { 

final String[] values = value.toString().split(SEPARATOR);

for (int i=0;i<values.length;i++){

/**

* id ; Agente Registrador   ; Total dominios;

* 1  ; 1&1 Internet    ; 382.972;

* 36 ; WEIS CONSULTING    ; 4.154;

* 71 ; MESH DIGITAL LIMITED ; 910;

* */

final String agente = format(values[1]);

final String totalDominios = format(values[2]); 

if (NumberUtils.isNumber(totalDominios.toString() ) ) 

context.write(new Text(agente), new DoubleWritable(NumberUtils.toDouble(totalDominios)));

 

}//del for

}

private String format(String value) {

return value.trim();

}

}

 

This is the reducer:

public class DominiosRegistradorReducer extends Reducer<Text, DoubleWritable, Text, Text> {

 

private final DecimalFormat decimalFormat = new DecimalFormat(“#.###”);

 

public void reduce(Text key, Iterable<DoubleWritable> totalDominiosValues, Context context)

throws IOException, InterruptedException {

double_maxtotalDominios = 0.0f;

 

for (DoubleWritable totalDominiosValue : totalDominiosValues) {

double_total = totalDominiosValue.get() ;

 

_maxtotalDominios = Math.max(_maxtotalDominios, _total);

}

// i need to keep with the agent which largest number of domains

context.write(key, new Text(decimalFormat.format(_maxtotalDominios)));

}

}

This is the main class:

publicclass App extends Configured implements Tool 

{

@Override

public int run(String[] args) throws Exception {

 

if (args.length != 2) {

System.err.println(“DominiosRegistradorManager required params: {input file} {output dir}”);

System.exit(-1);

}

 

deleteOutputFileIfExists(args);

 

final Job job = newJob(getConf(),“DominiosRegistradorManager”);

job.setJarByClass(App.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

 

job.setMapperClass(DominiosRegistradorMapper.class);

job.setReducerClass(DominiosRegistradorReducer.class);

 

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(DoubleWritable.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

 

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

 

job.waitForCompletion(true);

 

return 0;

}

 

private void deleteOutputFileIfExists(String[] args) throws IOException {

final Path output = new Path(args[1]);

FileSystem.get(output.toUri(), getConf()).delete(output, true);

}

 

public static void main(String[] args) throws Exception {

ToolRunner.run(new App(), args);

}

}

Now you have a glimpse of the code, you can download it and import to your eclipse. Once imported, you need to create a jar. With that jar and the cluster online, you are almost ready to launch the code, but probably you need to import the huge text file with data from http://datos.gob.es, download it and export to your cluster. I recommend to use the browser for that, click on Start BigInsigths if you don’t yet did , open Biginsights web console,  click Files, on the left you can see an HDFS tree, that is the hadoop file system, expand it until /Users/biadmin/, create a directory, for example, inputMR, so you can see /Users/biadmin/inputMR in your tree. You must upload the example file to that directory. You need to create outputMR directory as well

[biadmin@bivm ~]$ hadoop jar nameOfYourJar.jar /user/biadmin/inputMR /user/biadmin/outputMR
14/05/12 12:09:24 INFO input.FileInputFormat: Total input paths to process : 2
14/05/12 12:09:24 WARN snappy.LoadSnappy: Snappy native library is available
14/05/12 12:09:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 12:09:24 INFO snappy.LoadSnappy: Snappy native library loaded
14/05/12 12:09:24 INFO mapred.JobClient: Running job: job_201405121126_0059
14/05/12 12:09:25 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 12:09:31 INFO mapred.JobClient: map 50% reduce 0%
14/05/12 12:09:34 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 12:09:43 INFO mapred.JobClient: map 100% reduce 100%
14/05/12 12:09:44 INFO mapred.JobClient: Job complete: job_201405121126_0059
14/05/12 12:09:44 INFO mapred.JobClient: Counters: 29
14/05/12 12:09:44 INFO mapred.JobClient: Job Counters
14/05/12 12:09:44 INFO mapred.JobClient: Data-local map tasks=2
14/05/12 12:09:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8827
14/05/12 12:09:44 INFO mapred.JobClient: Launched map tasks=2
14/05/12 12:09:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 12:09:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 12:09:44 INFO mapred.JobClient: Launched reduce tasks=1
14/05/12 12:09:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10952
14/05/12 12:09:44 INFO mapred.JobClient: File Input Format Counters
14/05/12 12:09:44 INFO mapred.JobClient: Bytes Read=197
14/05/12 12:09:44 INFO mapred.JobClient: File Output Format Counters
14/05/12 12:09:44 INFO mapred.JobClient: Bytes Written=19
14/05/12 12:09:44 INFO mapred.JobClient: FileSystemCounters
14/05/12 12:09:44 INFO mapred.JobClient: HDFS_BYTES_READ=413
14/05/12 12:09:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=76101
14/05/12 12:09:44 INFO mapred.JobClient: FILE_BYTES_READ=50
14/05/12 12:09:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=19
14/05/12 12:09:44 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 12:09:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3867070464
14/05/12 12:09:44 INFO mapred.JobClient: Reduce input groups=2
14/05/12 12:09:44 INFO mapred.JobClient: Combine output records=4
14/05/12 12:09:44 INFO mapred.JobClient: Map output records=4
14/05/12 12:09:44 INFO mapred.JobClient: CPU time spent (ms)=1960
14/05/12 12:09:44 INFO mapred.JobClient: Map input records=2
14/05/12 12:09:44 INFO mapred.JobClient: Reduce shuffle bytes=56
14/05/12 12:09:44 INFO mapred.JobClient: Combine input records=4
14/05/12 12:09:44 INFO mapred.JobClient: Spilled Records=8
14/05/12 12:09:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=216
14/05/12 12:09:44 INFO mapred.JobClient: Map output bytes=36
14/05/12 12:09:44 INFO mapred.JobClient: Reduce input records=4
14/05/12 12:09:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=697741312
14/05/12 12:09:44 INFO mapred.JobClient: Total committed heap usage (bytes)=746494976
14/05/12 12:09:44 INFO mapred.JobClient: Reduce output records=2
14/05/12 12:09:44 INFO mapred.JobClient: Map output materialized bytes=56
[biadmin@bivm ~]$

If you see something like this, congrats! your map reduce task is already done! the results are in /users/biadmin/outputMR

the source is located in https://github.com/alonsoir/mrDominioRegistrador

the data is taken from http://datos.gob.es

http://en.wikipedia.org/wiki/MapReduce

Enjoy!

 

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s