MapReduce - What, How & Why - DMJCCLT

MapReduce - What, How & Why

1. What is MapReduce?

MapReduce is a programming model and a scalable processing technique designed for handling large datasets. It is used primarily in distributed systems, such as Hadoop, to process and generate massive amounts of data in parallel across clusters of computers.

1.1 Key Concepts

Map: Divides input data into smaller, manageable sub-tasks, processes them, and produces intermediate key-value pairs.
Reduce: Aggregates the intermediate key-value pairs produced by the Map phase to generate final results.
Parallelism: The process executes tasks in parallel on multiple nodes for faster computation.

2. How is MapReduce Used?

2.1 Workflow

The execution of a MapReduce program involves the following steps:

Input Splitting: The input dataset is divided into chunks, each of which is processed independently.
Mapping: The Mapper function processes each input split and emits intermediate key-value pairs.
Shuffling and Sorting: The intermediate key-value pairs are grouped by key. This step is managed by the system.
Reducing: The Reducer function processes grouped data to produce the final output.
Output: The results are stored in a specified format, typically in distributed storage systems like HDFS.

2.2 Implementation

Here is an example of a simple MapReduce implementation for counting word occurrences:

// Mapper Class
public static class MapClass extends MapReduceBase 
  implements Mapper {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, 
      OutputCollector output, Reporter reporter) 
      throws IOException {
    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line);
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, one);
    }
  }
}

// Reducer Class
public static class ReduceClass extends MapReduceBase 
  implements Reducer {
  public void reduce(Text key, Iterator values, 
      OutputCollector output, Reporter reporter) 
      throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

// Driver Code
public void run(String inputPath, String outputPath) throws Exception {
  JobConf conf = new JobConf(WordCount.class);
  conf.setJobName("wordcount");

  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);

  conf.setMapperClass(MapClass.class);
  conf.setReducerClass(ReduceClass.class);

  FileInputFormat.addInputPath(conf, new Path(inputPath));
  FileOutputFormat.setOutputPath(conf, new Path(outputPath));
  JobClient.runJob(conf);
}

3. Why is MapReduce Used?

3.1 Advantages

Scalability: Processes petabytes of data across a distributed cluster.
Fault Tolerance: Handles node failures by reassigning failed tasks to other nodes.
Simplicity: Allows developers to focus on logic while the system manages underlying complexity.
Cost-Effective: Leverages commodity hardware for large-scale data processing.

3.2 Use Cases

Data Analytics: Processes logs, customer data, and social media trends.
Search Engine Indexing: Builds and updates indexes for web search engines.
Machine Learning: Performs distributed training and analysis of datasets.
ETL Processes: Extract, transform, and load large datasets for data warehousing.