MapReduce - What, How & Why
2024, December 24
MapReduce is a programming model and a scalable processing technique designed for handling large datasets. It is used primarily in distributed systems, such as Hadoop, to process and generate massive amounts of data in parallel across clusters of computers.
The execution of a MapReduce program involves the following steps:
Mapper
function processes each input split and emits intermediate key-value pairs.Reducer
function processes grouped data to produce the final output.Here is an example of a simple MapReduce implementation for counting word occurrences:
// Mapper Class
public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
// Reducer Class
public static class ReduceClass extends MapReduceBase
implements Reducer {
public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
// Driver Code
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(ReduceClass.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}