Homework Assignment 1 - DMJCCLT

Homework Assignment 1: Real-World Cloud Computing and Distributed Systems Challenge

This assignment combines practical application, critical thinking, and problem-solving skills. You will create a mini-cloud computing system simulation to understand the concepts of distributed systems, MapReduce, and cloud computing paradigms. Follow the instructions carefully and apply your knowledge creatively.

Scenario: Building a Cloud-Based File Word Counter

Imagine you work for a company that handles large-scale text processing for analytics. Your task is to design a distributed system using MapReduce to calculate the word count of uploaded files. The system must simulate a real-world cloud environment.

Objectives

Understand the concept of distributed systems and their applications.
Implement a simple MapReduce program to count words in a file.
Explore and simulate the characteristics of cloud computing such as scalability and fault tolerance.
Demonstrate teamwork and report insights effectively (for team submissions).

Instructions

Follow these steps to complete your homework:

1. Set Up a Simulated Distributed Environment

Use Python (or Java if comfortable) to simulate a cloud environment.
Create multiple "nodes" (simulated as separate scripts or threads).
Each node will act as a worker for MapReduce tasks.

2. Implement the Map Function

The Map function should process lines of a given text file and output key-value pairs of words and their occurrence counts.


# Example: Python Map Function
def map_function(line):
    words = line.split()
    return [(word, 1) for word in words]

3. Simulate Data Distribution

Split a large text file into smaller chunks to simulate distribution across nodes.
Each chunk will be processed by a different node using the Map function.

4. Implement the Reduce Function

The Reduce function should aggregate the counts of words from all nodes and produce the final word count.


# Example: Python Reduce Function
from collections import defaultdict

def reduce_function(mapped_data):
    word_count = defaultdict(int)
    for word, count in mapped_data:
        word_count[word] += count
    return word_count

5. Simulate Fault Tolerance

Randomly simulate a failure in one or more nodes during execution.
Implement a mechanism to retry failed tasks on another node.

6. Visualize the Results

Display the final word count using a bar chart or any suitable visualization.
You may use Python libraries such as Matplotlib or Seaborn for this.

7. Report and Submit

Create a report that includes:

A brief introduction to the task and its objectives.
The code for your solution with comments explaining each part.
Observations about scalability, fault tolerance, and performance.
Suggestions for improving the system.

Evaluation Criteria

Your assignment will be evaluated based on:

Correctness of the MapReduce implementation (30%).
Simulation of distributed and fault-tolerant environments (30%).
Clarity and structure of the report (20%).
Visualization and creativity (20%).

Additional Notes

You may work individually or in pairs.
Submission deadline: [Insert Deadline Here].
Ensure your code is well-documented and easy to understand.
If you face issues, document your challenges and how you attempted to overcome them in your report.

Solution: Cloud-Based File Word Counter Using MapReduce

Important: Please attempt the assignment on your own before referring to this solution. The process of trying, failing, and learning will strengthen your understanding and skills.

Step-by-Step Solution

1. Setting Up the Simulated Distributed Environment

We simulate nodes using threads to handle tasks in parallel.


import threading

# Simulate nodes as worker threads
class Node(threading.Thread):
    def __init__(self, id, task_queue, results):
        threading.Thread.__init__(self)
        self.id = id
        self.task_queue = task_queue
        self.results = results

    def run(self):
        while not self.task_queue.empty():
            try:
                chunk = self.task_queue.get_nowait()
                self.results.append(map_function(chunk))
            except:
                break

2. Implementing the Map Function

The Map function processes lines from the text file and returns word-count pairs.


def map_function(chunk):
    word_counts = []
    for line in chunk:
        words = line.split()
        word_counts.extend([(word, 1) for word in words])
    return word_counts

3. Distributing Data

Divide the text file into chunks for processing by nodes.


from queue import Queue

# Split file into chunks
def split_file(file_path, num_chunks):
    with open(file_path, 'r') as f:
        lines = f.readlines()
    chunk_size = len(lines) // num_chunks
    return [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]

# Initialize task queue
task_queue = Queue()
chunks = split_file('input.txt', 4)
for chunk in chunks:
    task_queue.put(chunk)

4. Implementing the Reduce Function

The Reduce function aggregates word counts from all nodes.


from collections import defaultdict

def reduce_function(mapped_results):
    final_word_count = defaultdict(int)
    for result in mapped_results:
        for word, count in result:
            final_word_count[word] += count
    return final_word_count

5. Simulating Fault Tolerance

Randomly simulate node failure and reassign failed tasks to remaining nodes.


import random

def simulate_failure(node_list):
    # Randomly deactivate one node
    failed_node = random.choice(node_list)
    print(f"Node {failed_node.id} failed!")
    node_list.remove(failed_node)
    return node_list

6. Running the Simulation

Execute the MapReduce job with fault tolerance and aggregation.


if __name__ == "__main__":
    num_nodes = 4
    results = []
    nodes = [Node(i, task_queue, results) for i in range(num_nodes)]

    for node in nodes:
        node.start()

    # Simulate failure
    nodes = simulate_failure(nodes)

    for node in nodes:
        node.join()

    # Aggregate results
    word_count = reduce_function(results)
    print(word_count)

7. Visualizing Results

Generate a bar chart of the word counts.


import matplotlib.pyplot as plt

def visualize_results(word_count):
    words = list(word_count.keys())
    counts = list(word_count.values())

    plt.bar(words, counts)
    plt.xlabel("Words")
    plt.ylabel("Frequency")
    plt.title("Word Count Visualization")
    plt.show()

visualize_results(word_count)

Encouragement

Learning comes from doing! If you attempted the assignment and then reviewed the solution, you’re on the right path. Reflect on the differences between your approach and the solution to deepen your understanding. Keep practicing and building your problem-solving skills!