Finally i can continue with this post, a sample with a big data technology, for example, a java map reduce task running on apache hadoop.
First at all, you need to install hadoop, and i have to say that it is not trivial, depending of your SO, you may install it with apt, yum, brew, etc… or like i did, downloading a vmware image with all necessary stuff. There are some providers, like Cloudera or IBM BigInsights. I choose the last one because of i learn big data concepts in bigdatauniversity.com, an iniciative from IBM.
Once downloaded the big insights vmware image, you can launch the boot, login with biadmin/biadmin and then click on Start BigInsights button, after few minutes, hadoop will be up and running. Go to http://bivm:8080/data/html/index.html#redirect-welcome in the firefox big insights and you can see it.
Once you have a hadoop cluster to play, it is time to code something, but first, you need to analyze the text, i put a little text, but real data are terabytes, hexabytes or more data with this format, thousands of billions lines with this format:
id ; Agente Registrador ; Total dominios;
1 ; 1&1 Internet ; 382.972;
36 ; WEIS CONSULTING ; 4.154;
71 ; MESH DIGITAL LIMITED ; 910;
This is the mapper, the purpose of the mapper is to create a list with keys and values.
public class DominiosRegistradorMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
privatestaticfinal String SEPARATOR = “;”;
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
final String[] values = value.toString().split(SEPARATOR);
for (int i=0;i<values.length;i++){
/**
* id ; Agente Registrador ; Total dominios;
* 1 ; 1&1 Internet ; 382.972;
* 36 ; WEIS CONSULTING ; 4.154;
* 71 ; MESH DIGITAL LIMITED ; 910;
*
* */
final String agente = format(values[1]);
final String totalDominios = format(values[2]);
if (NumberUtils.isNumber(totalDominios.toString() ) )
context.write(new Text(agente), new DoubleWritable(NumberUtils.toDouble(totalDominios)));
}//del for
}
private String format(String value) {
return value.trim();
}
}
This is the reducer:
public class DominiosRegistradorReducer extends Reducer<Text, DoubleWritable, Text, Text> {
private final DecimalFormat decimalFormat = new DecimalFormat(“#.###”);
public void reduce(Text key, Iterable<DoubleWritable> totalDominiosValues, Context context)
throws IOException, InterruptedException {
double_maxtotalDominios = 0.0f;
for (DoubleWritable totalDominiosValue : totalDominiosValues) {
double_total = totalDominiosValue.get() ;
_maxtotalDominios = Math.max(_maxtotalDominios, _total);
}
// i need to keep with the agent which largest number of domains
context.write(key, new Text(decimalFormat.format(_maxtotalDominios)));
}
}
This is the main class:
publicclass App extends Configured implements Tool
{
@Override
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println(“DominiosRegistradorManager required params: {input file} {output dir}”);
System.exit(-1);
}
deleteOutputFileIfExists(args);
final Job job = newJob(getConf(),“DominiosRegistradorManager”);
job.setJarByClass(App.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(DominiosRegistradorMapper.class);
job.setReducerClass(DominiosRegistradorReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(DoubleWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
return 0;
}
private void deleteOutputFileIfExists(String[] args) throws IOException {
final Path output = new Path(args[1]);
FileSystem.get(output.toUri(), getConf()).delete(output, true);
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new App(), args);
}
}
Now you have a glimpse of the code, you can download it and import to your eclipse. Once imported, you need to create a jar. With that jar and the cluster online, you are almost ready to launch the code, but probably you need to import the huge text file with data from http://datos.gob.es, download it and export to your cluster. I recommend to use the browser for that, click on Start BigInsigths if you don’t yet did , open Biginsights web console, click Files, on the left you can see an HDFS tree, that is the hadoop file system, expand it until /Users/biadmin/, create a directory, for example, inputMR, so you can see /Users/biadmin/inputMR in your tree. You must upload the example file to that directory. You need to create outputMR directory as well
[biadmin@bivm ~]$ hadoop jar nameOfYourJar.jar /user/biadmin/inputMR /user/biadmin/outputMR
14/05/12 12:09:24 INFO input.FileInputFormat: Total input paths to process : 2
14/05/12 12:09:24 WARN snappy.LoadSnappy: Snappy native library is available
14/05/12 12:09:24 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/05/12 12:09:24 INFO snappy.LoadSnappy: Snappy native library loaded
14/05/12 12:09:24 INFO mapred.JobClient: Running job: job_201405121126_0059
14/05/12 12:09:25 INFO mapred.JobClient: map 0% reduce 0%
14/05/12 12:09:31 INFO mapred.JobClient: map 50% reduce 0%
14/05/12 12:09:34 INFO mapred.JobClient: map 100% reduce 0%
14/05/12 12:09:43 INFO mapred.JobClient: map 100% reduce 100%
14/05/12 12:09:44 INFO mapred.JobClient: Job complete: job_201405121126_0059
14/05/12 12:09:44 INFO mapred.JobClient: Counters: 29
14/05/12 12:09:44 INFO mapred.JobClient: Job Counters
14/05/12 12:09:44 INFO mapred.JobClient: Data-local map tasks=2
14/05/12 12:09:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8827
14/05/12 12:09:44 INFO mapred.JobClient: Launched map tasks=2
14/05/12 12:09:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/05/12 12:09:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/05/12 12:09:44 INFO mapred.JobClient: Launched reduce tasks=1
14/05/12 12:09:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10952
14/05/12 12:09:44 INFO mapred.JobClient: File Input Format Counters
14/05/12 12:09:44 INFO mapred.JobClient: Bytes Read=197
14/05/12 12:09:44 INFO mapred.JobClient: File Output Format Counters
14/05/12 12:09:44 INFO mapred.JobClient: Bytes Written=19
14/05/12 12:09:44 INFO mapred.JobClient: FileSystemCounters
14/05/12 12:09:44 INFO mapred.JobClient: HDFS_BYTES_READ=413
14/05/12 12:09:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=76101
14/05/12 12:09:44 INFO mapred.JobClient: FILE_BYTES_READ=50
14/05/12 12:09:44 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=19
14/05/12 12:09:44 INFO mapred.JobClient: Map-Reduce Framework
14/05/12 12:09:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3867070464
14/05/12 12:09:44 INFO mapred.JobClient: Reduce input groups=2
14/05/12 12:09:44 INFO mapred.JobClient: Combine output records=4
14/05/12 12:09:44 INFO mapred.JobClient: Map output records=4
14/05/12 12:09:44 INFO mapred.JobClient: CPU time spent (ms)=1960
14/05/12 12:09:44 INFO mapred.JobClient: Map input records=2
14/05/12 12:09:44 INFO mapred.JobClient: Reduce shuffle bytes=56
14/05/12 12:09:44 INFO mapred.JobClient: Combine input records=4
14/05/12 12:09:44 INFO mapred.JobClient: Spilled Records=8
14/05/12 12:09:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=216
14/05/12 12:09:44 INFO mapred.JobClient: Map output bytes=36
14/05/12 12:09:44 INFO mapred.JobClient: Reduce input records=4
14/05/12 12:09:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=697741312
14/05/12 12:09:44 INFO mapred.JobClient: Total committed heap usage (bytes)=746494976
14/05/12 12:09:44 INFO mapred.JobClient: Reduce output records=2
14/05/12 12:09:44 INFO mapred.JobClient: Map output materialized bytes=56
[biadmin@bivm ~]$
If you see something like this, congrats! your map reduce task is already done! the results are in /users/biadmin/outputMR
the source is located in https://github.com/alonsoir/mrDominioRegistrador
the data is taken from http://datos.gob.es
http://en.wikipedia.org/wiki/MapReduce
Enjoy!