MapReduce - What, How & Why - DMJCCLT - dmj.one

MapReduce - What, How & Why

1. What is MapReduce?

MapReduce is a programming model and a scalable processing technique designed for handling large datasets. It is used primarily in distributed systems, such as Hadoop, to process and generate massive amounts of data in parallel across clusters of computers.

1.1 Key Concepts

2. How is MapReduce Used?

2.1 Workflow

The execution of a MapReduce program involves the following steps:

  1. Input Splitting: The input dataset is divided into chunks, each of which is processed independently.
  2. Mapping: The Mapper function processes each input split and emits intermediate key-value pairs.
  3. Shuffling and Sorting: The intermediate key-value pairs are grouped by key. This step is managed by the system.
  4. Reducing: The Reducer function processes grouped data to produce the final output.
  5. Output: The results are stored in a specified format, typically in distributed storage systems like HDFS.

2.2 Implementation

Here is an example of a simple MapReduce implementation for counting word occurrences:

// Mapper Class
public static class MapClass extends MapReduceBase 
  implements Mapper {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, 
      OutputCollector output, Reporter reporter) 
      throws IOException {
    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line);
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, one);
    }
  }
}

// Reducer Class
public static class ReduceClass extends MapReduceBase 
  implements Reducer {
  public void reduce(Text key, Iterator values, 
      OutputCollector output, Reporter reporter) 
      throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

// Driver Code
public void run(String inputPath, String outputPath) throws Exception {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");

  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);

  conf.setMapperClass(MapClass.class);
  conf.setReducerClass(ReduceClass.class);

  FileInputFormat.addInputPath(conf, new Path(inputPath));
  FileOutputFormat.setOutputPath(conf, new Path(outputPath));
  JobClient.runJob(conf);
}

3. Why is MapReduce Used?

3.1 Advantages

3.2 Use Cases