This assignment combines practical application, critical thinking, and problem-solving skills. You will create a mini-cloud computing system simulation to understand the concepts of distributed systems, MapReduce, and cloud computing paradigms. Follow the instructions carefully and apply your knowledge creatively.
Scenario: Building a Cloud-Based File Word Counter
Imagine you work for a company that handles large-scale text processing for analytics. Your task is to design a distributed system using MapReduce to calculate the word count of uploaded files. The system must simulate a real-world cloud environment.
Objectives
- Understand the concept of distributed systems and their applications.
- Implement a simple MapReduce program to count words in a file.
- Explore and simulate the characteristics of cloud computing such as scalability and fault tolerance.
- Demonstrate teamwork and report insights effectively (for team submissions).
Instructions
Follow these steps to complete your homework:
1. Set Up a Simulated Distributed Environment
- Use Python (or Java if comfortable) to simulate a cloud environment.
- Create multiple "nodes" (simulated as separate scripts or threads).
- Each node will act as a worker for MapReduce tasks.
2. Implement the Map Function
The Map function should process lines of a given text file and output key-value pairs of words and their occurrence counts.
# Example: Python Map Function
def map_function(line):
words = line.split()
return [(word, 1) for word in words]
3. Simulate Data Distribution
- Split a large text file into smaller chunks to simulate distribution across nodes.
- Each chunk will be processed by a different node using the Map function.
4. Implement the Reduce Function
The Reduce function should aggregate the counts of words from all nodes and produce the final word count.
# Example: Python Reduce Function
from collections import defaultdict
def reduce_function(mapped_data):
word_count = defaultdict(int)
for word, count in mapped_data:
word_count[word] += count
return word_count
5. Simulate Fault Tolerance
- Randomly simulate a failure in one or more nodes during execution.
- Implement a mechanism to retry failed tasks on another node.
6. Visualize the Results
- Display the final word count using a bar chart or any suitable visualization.
- You may use Python libraries such as Matplotlib or Seaborn for this.
7. Report and Submit
Create a report that includes:
- A brief introduction to the task and its objectives.
- The code for your solution with comments explaining each part.
- Observations about scalability, fault tolerance, and performance.
- Suggestions for improving the system.
Evaluation Criteria
Your assignment will be evaluated based on:
- Correctness of the MapReduce implementation (30%).
- Simulation of distributed and fault-tolerant environments (30%).
- Clarity and structure of the report (20%).
- Visualization and creativity (20%).
Additional Notes
- You may work individually or in pairs.
- Submission deadline: [Insert Deadline Here].
- Ensure your code is well-documented and easy to understand.
- If you face issues, document your challenges and how you attempted to overcome them in your report.
Solution: Cloud-Based File Word Counter Using MapReduce
Important: Please attempt the assignment on your own before referring to this solution. The process of trying, failing, and learning will strengthen your understanding and skills.
Step-by-Step Solution
1. Setting Up the Simulated Distributed Environment
We simulate nodes using threads to handle tasks in parallel.
import threading
# Simulate nodes as worker threads
class Node(threading.Thread):
def __init__(self, id, task_queue, results):
threading.Thread.__init__(self)
self.id = id
self.task_queue = task_queue
self.results = results
def run(self):
while not self.task_queue.empty():
try:
chunk = self.task_queue.get_nowait()
self.results.append(map_function(chunk))
except:
break
2. Implementing the Map Function
The Map function processes lines from the text file and returns word-count pairs.
def map_function(chunk):
word_counts = []
for line in chunk:
words = line.split()
word_counts.extend([(word, 1) for word in words])
return word_counts
3. Distributing Data
Divide the text file into chunks for processing by nodes.
from queue import Queue
# Split file into chunks
def split_file(file_path, num_chunks):
with open(file_path, 'r') as f:
lines = f.readlines()
chunk_size = len(lines) // num_chunks
return [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]
# Initialize task queue
task_queue = Queue()
chunks = split_file('input.txt', 4)
for chunk in chunks:
task_queue.put(chunk)
4. Implementing the Reduce Function
The Reduce function aggregates word counts from all nodes.
from collections import defaultdict
def reduce_function(mapped_results):
final_word_count = defaultdict(int)
for result in mapped_results:
for word, count in result:
final_word_count[word] += count
return final_word_count
5. Simulating Fault Tolerance
Randomly simulate node failure and reassign failed tasks to remaining nodes.
import random
def simulate_failure(node_list):
# Randomly deactivate one node
failed_node = random.choice(node_list)
print(f"Node {failed_node.id} failed!")
node_list.remove(failed_node)
return node_list
6. Running the Simulation
Execute the MapReduce job with fault tolerance and aggregation.
if __name__ == "__main__":
num_nodes = 4
results = []
nodes = [Node(i, task_queue, results) for i in range(num_nodes)]
for node in nodes:
node.start()
# Simulate failure
nodes = simulate_failure(nodes)
for node in nodes:
node.join()
# Aggregate results
word_count = reduce_function(results)
print(word_count)
7. Visualizing Results
Generate a bar chart of the word counts.
import matplotlib.pyplot as plt
def visualize_results(word_count):
words = list(word_count.keys())
counts = list(word_count.values())
plt.bar(words, counts)
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.title("Word Count Visualization")
plt.show()
visualize_results(word_count)
Encouragement
Learning comes from doing! If you attempted the assignment and then reviewed the solution, you’re on the right path. Reflect on the differences between your approach and the solution to deepen your understanding. Keep practicing and building your problem-solving skills!