1. Introduction to MapReduce
MapReduce is a programming model used to process and generate large datasets across a distributed system. It simplifies parallel computation by abstracting the processing into two primary functions: Map and Reduce.
2. Core Components of MapReduce
2.1 Map Function
The Map function processes input data, line by line, and outputs key-value pairs. Each line of input data is parsed, and meaningful keys and values are generated for further processing.
public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
2.2 Reduce Function
The Reduce function aggregates values associated with a common key. It takes the intermediate key-value pairs from the Map phase and combines the values for each key to produce the final result.
public static class ReduceClass extends MapReduceBase
implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
3. MapReduce Workflow
3.1 Input Data
Data is stored in the form of key-value pairs, which are processed line by line during the Map phase.
3.2 Execution Phases
- Map Phase: Transforms input data into intermediate key-value pairs.
- Shuffle and Sort: Groups intermediate pairs by key.
- Reduce Phase: Aggregates values of each grouped key to produce the output.
3.3 Example
For a word count problem, the Map function outputs (word, 1)
for each word, and the Reduce function sums up counts for each unique word.
4. Practical Applications of MapReduce
- Web Indexing: Search engines use MapReduce for indexing web pages.
- Log Analysis: Processing large-scale server logs to identify patterns.
- Data Mining: Analyzing massive datasets for trends and insights.
- Social Network Analysis: Identifying mutual connections in networks like Twitter or Facebook.
5. Sample Use Case: Mutual Followers
5.1 Problem Statement
Given a dataset of user relationships (a, b)
, where a
follows b
, find all pairs of users who mutually follow each other.
5.2 Map Function
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter)
throws IOException {
String line = value.toString();
String[] users = line.split(",");
if (users[0].compareTo(users[1]) < 0) {
output.collect(new Text("(" + users[0] + "," + users[1] + ")"), new IntWritable(1));
} else {
output.collect(new Text("(" + users[1] + "," + users[0] + ")"), new IntWritable(1));
}
}
5.3 Reduce Function
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)
throws IOException {
int count = 0;
while (values.hasNext()) {
count += values.next().get();
}
if (count == 2) {
output.collect(key, new IntWritable(count));
}
}
6. Key Features of MapReduce
- Scalability: Processes large datasets using distributed computing.
- Fault Tolerance: Automatically handles failures during computation.
- Data Locality: Processes data close to its storage location, reducing I/O.
- Simplicity: Abstracts parallel programming complexities for developers.