About how to use spring-data-hadoop

Hi, after some holidays i am back. this post is about how to use spring-data technology with apache hadoop, or how to write a map reduce task using spring. The idea is to provide a way on how to focus in the most important part about apache hadoop, the map reduce writing task.

So, lets begin with the project, the most impatient (like myself) can find the sources project here.

The example is written in a maven style, so we can see the dependencies in the pom.xml file:

4.0.0
es.aironman.samples
my-spring-data-mapreduce
0.0.1-SNAPSHOT
my sample about how to code a map reduce task for hadoop using spring-data

1.0.3
1.6.1
3.1.2.RELEASE
1.0.0.RELEASE
UTF-8

commons-lang
commons-lang
2.6

org.springframework
spring-beans
${spring.version}

org.springframework
spring-core
${spring.version}

org.springframework
spring-context-support
${spring.version}

org.springframework
spring-context
${spring.version}

cglib
cglib
2.2.2

org.springframework.data
spring-data-hadoop
${spring.data.hadoop.version}

org.apache.hadoop
hadoop-core
${apache.hadoop.version}

org.slf4j
slf4j-api
${slf4j.version}

org.slf4j
slf4j-log4j12
${slf4j.version}

log4j
log4j
1.2.16

junit
junit
4.9
test

org.mockito
mockito-core
1.8.5
test

my-spring-data-mapreduce

org.apache.maven.plugins
maven-compiler-plugin
2.3.2

1.6
1.6

org.apache.maven.plugins
maven-assembly-plugin
2.2.2

src/main/assembly/assembly.xml

org.apache.maven.plugins
maven-jar-plugin
2.3.1

true
lib/
net.petrikainulainen.spring.data.apachehadoop.Main

org.apache.maven.plugins
maven-site-plugin
3.0

org.codehaus.mojo
cobertura-maven-plugin
2.5.1

If you see this file, you can say me that i am not using the latest versions of dependencies!, with time i promise to update this or maybe you can send me a pull/push request to this github project😉 .

In applicationContext.xml file we can see how spring-data project declare which map reduce job is going to be executed.

fs.default.name=${fs.default.name}
mapred.job.tracker=${mapred.job.tracker}

There is an application.properties file with necessary config data, like where is the HDFS (Hadoop data file system), where is the hadoop tracker listening, the input data path with necessary data to be filtered and the output data path with the result. Please, do not forget to erase the output data directory if you launch the map reduce task more that once.

application.properties

fs.default.name=hdfs://localhost:9000
mapred.job.tracker=localhost:9001

input.path=/input/
output.path=/output/

maybe you are going to need to change localhost with the hadoop ip address, check it out!

Now the map reduce classes, they are the same of another project i have talked about in this blog, so i am not going I will not delve deeper into this subject.

I think the code is already well documented, so, the mapper class is:


package es.aironman.samples.spring.data.hadoop;

import java.io.IOException;

import org.apache.commons.lang.math.NumberUtils;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class DominiosRegistradorMapper extends Mapper {

private static final String SEPARATOR = ";";

@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {

final String[] values = value.toString().split(SEPARATOR);
String agent ;
String totalDomains;
for (int i=0;i<values.length;i++){

agent = format(values[1]);
totalDomains = format(values[2]);

if (NumberUtils.isNumber(totalDomains.toString() ) ){
context.write(new Text(agent), new DoubleWritable(NumberUtils.toDouble(totalDomains)));
}

}//del for
}
private String format(String value) {
return value.trim();
}

}

You may see that data file has this format:

id ; Agente Registrador ; Total dominios;
1 ; 1&1 Internet ; 382.972;
36 ; WEIS CONSULTING ; 4.154;
71 ; MESH DIGITAL LIMITED ; 910;

The idea of the mapper is to split every line by “;”, get every Agente Registrador (agent recorder) and each Total dominios (total domains) and write it the hadoop context. This is a very simple hadoop task, in this phase you can choose which agente recorder want to write to context, for simplicity, i choose to write every agent with its total domain to the context.

Now the reducer class:


package es.aironman.samples.spring.data.hadoop;

import java.io.IOException;
import java.text.DecimalFormat;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/***
*
* This Reducer operation consists in keep the largest value of total of registered domains
* @author aironman
*
*/
public class DominiosRegistradorReducer extends Reducer {

private final DecimalFormat decimalFormat = new DecimalFormat("#.###");

public void reduce(Text key, Iterable totalDominiosValues, Context context)
throws IOException, InterruptedException {
double _maxTotalDomains = 0.0f;
for (DoubleWritable totalDominiosValue : totalDominiosValues) {
double _total = totalDominiosValue.get() ;

_maxTotalDomains = Math.max(_maxTotalDomains, _total);
}
context.write(key, new Text(decimalFormat.format(_maxTotalDomains)));

}

}

As you can guess, in this phase i am keeping from hadoop context only the max total domais of each agent. Maybe you want to calculate the minimal or the average. For that, you are going to write a custom writable, but that is beyond of this post. Keep it post for future updates.

Now that´s it!, you can assembly the jar with this command:

mvm clean assembly:assembly

If everything is ok, you can see this output:

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[INFO] Scanning for projects…
[INFO]
[INFO] ————————————————————————
[INFO] Building my-spring-data-mapreduce 0.0.1-SNAPSHOT
[INFO] ————————————————————————
[INFO]
[INFO] — maven-clean-plugin:2.4.1:clean (default-clean) @ my-spring-data-mapreduce —
[INFO] Deleting /Users/aironman/Documents/ws-spring-data-hadoop/my-spring-data-mapreduce/target
[INFO]
[INFO] ————————————————————————
[INFO] Building my-spring-data-mapreduce 0.0.1-SNAPSHOT
[INFO] ————————————————————————
[INFO]
[INFO] >>> maven-assembly-plugin:2.2.2:assembly (default-cli) @ my-spring-data-mapreduce >>>
[INFO]
[INFO] — maven-resources-plugin:2.5:resources (default-resources) @ my-spring-data-mapreduce —
[debug] execute contextualize
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] Copying 3 resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:compile (default-compile) @ my-spring-data-mapreduce —
[INFO] Compiling 3 source files to /Users/aironman/Documents/ws-spring-data-hadoop/my-spring-data-mapreduce/target/classes
[INFO]
[INFO] — maven-resources-plugin:2.5:testResources (default-testResources) @ my-spring-data-mapreduce —
[debug] execute contextualize
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] Copying 0 resource
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ my-spring-data-mapreduce —
[INFO] Nothing to compile – all classes are up to date
[INFO]
[INFO] — maven-surefire-plugin:2.10:test (default-test) @ my-spring-data-mapreduce —
[INFO] Surefire report directory: /Users/aironman/Documents/ws-spring-data-hadoop/my-spring-data-mapreduce/target/surefire-reports

——————————————————-
T E S T S
——————————————————-

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] — maven-jar-plugin:2.3.1:jar (default-jar) @ my-spring-data-mapreduce —
[INFO] Building jar: /Users/aironman/Documents/ws-spring-data-hadoop/my-spring-data-mapreduce/target/my-spring-data-mapreduce.jar
[INFO]
[INFO] <<< maven-assembly-plugin:2.2.2:assembly (default-cli) @ my-spring-data-mapreduce <<<
[INFO]
[INFO] — maven-assembly-plugin:2.2.2:assembly (default-cli) @ my-spring-data-mapreduce —
[INFO] Reading assembly descriptor: src/main/assembly/assembly.xml
[INFO] Building zip: /Users/aironman/Documents/ws-spring-data-hadoop/my-spring-data-mapreduce/target/my-spring-data-mapreduce-bin.zip
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 4.245s
[INFO] Finished at: Wed Aug 13 11:59:42 CEST 2014
[INFO] Final Memory: 16M/315M
[INFO] ————————————————————————

The assembly phase will provide you a zip file, unzip it into your hadoop cluster and launch startup.sh script file.

Enjoy!

Update

this is the link from spring-data project

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s