MapReduce

1. Introduction to MapReduce

MapReduce is a programming model used to process and generate large datasets across a distributed system. It simplifies parallel computation by abstracting the processing into two primary functions: Map and Reduce.

2. Core Components of MapReduce

2.1 Map Function

The Map function processes input data, line by line, and outputs key-value pairs. Each line of input data is parsed, and meaningful keys and values are generated for further processing.


public static class MapClass extends MapReduceBase 
 implements Mapper {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) 
      throws IOException {
    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line);
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, one);
    }
  }
}

2.2 Reduce Function

The Reduce function aggregates values associated with a common key. It takes the intermediate key-value pairs from the Map phase and combines the values for each key to produce the final result.


public static class ReduceClass extends MapReduceBase 
 implements Reducer {
  public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) 
      throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
  }
}

3. MapReduce Workflow

3.1 Input Data

Data is stored in the form of key-value pairs, which are processed line by line during the Map phase.

3.2 Execution Phases

Map Phase: Transforms input data into intermediate key-value pairs.
Shuffle and Sort: Groups intermediate pairs by key.
Reduce Phase: Aggregates values of each grouped key to produce the output.

3.3 Example

For a word count problem, the Map function outputs (word, 1) for each word, and the Reduce function sums up counts for each unique word.

4. Practical Applications of MapReduce

Web Indexing: Search engines use MapReduce for indexing web pages.
Log Analysis: Processing large-scale server logs to identify patterns.
Data Mining: Analyzing massive datasets for trends and insights.
Social Network Analysis: Identifying mutual connections in networks like Twitter or Facebook.

5. Sample Use Case: Mutual Followers

5.1 Problem Statement

Given a dataset of user relationships (a, b), where a follows b, find all pairs of users who mutually follow each other.

5.2 Map Function


public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) 
    throws IOException {
  String line = value.toString();
  String[] users = line.split(",");
  if (users[0].compareTo(users[1]) < 0) {
    output.collect(new Text("(" + users[0] + "," + users[1] + ")"), new IntWritable(1));
  } else {
    output.collect(new Text("(" + users[1] + "," + users[0] + ")"), new IntWritable(1));
  }
}

5.3 Reduce Function


public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) 
    throws IOException {
  int count = 0;
  while (values.hasNext()) {
    count += values.next().get();
  }
  if (count == 2) {
    output.collect(key, new IntWritable(count));
  }
}

6. Key Features of MapReduce

Scalability: Processes large datasets using distributed computing.
Fault Tolerance: Automatically handles failures during computation.
Data Locality: Processes data close to its storage location, reducing I/O.
Simplicity: Abstracts parallel programming complexities for developers.