Homework Assignment 1 - DMJCCLT - dmj.one

Homework Assignment 1: Real-World Cloud Computing and Distributed Systems Challenge

This assignment combines practical application, critical thinking, and problem-solving skills. You will create a mini-cloud computing system simulation to understand the concepts of distributed systems, MapReduce, and cloud computing paradigms. Follow the instructions carefully and apply your knowledge creatively.

Scenario: Building a Cloud-Based File Word Counter

Imagine you work for a company that handles large-scale text processing for analytics. Your task is to design a distributed system using MapReduce to calculate the word count of uploaded files. The system must simulate a real-world cloud environment.

Objectives

Instructions

Follow these steps to complete your homework:

1. Set Up a Simulated Distributed Environment
2. Implement the Map Function

The Map function should process lines of a given text file and output key-value pairs of words and their occurrence counts.


# Example: Python Map Function
def map_function(line):
    words = line.split()
    return [(word, 1) for word in words]
3. Simulate Data Distribution
4. Implement the Reduce Function

The Reduce function should aggregate the counts of words from all nodes and produce the final word count.


# Example: Python Reduce Function
from collections import defaultdict

def reduce_function(mapped_data):
    word_count = defaultdict(int)
    for word, count in mapped_data:
        word_count[word] += count
    return word_count
5. Simulate Fault Tolerance
6. Visualize the Results
7. Report and Submit

Create a report that includes:

Evaluation Criteria

Your assignment will be evaluated based on:

Additional Notes

Solution: Cloud-Based File Word Counter Using MapReduce

Important: Please attempt the assignment on your own before referring to this solution. The process of trying, failing, and learning will strengthen your understanding and skills.

Step-by-Step Solution

1. Setting Up the Simulated Distributed Environment

We simulate nodes using threads to handle tasks in parallel.


import threading

# Simulate nodes as worker threads
class Node(threading.Thread):
    def __init__(self, id, task_queue, results):
        threading.Thread.__init__(self)
        self.id = id
        self.task_queue = task_queue
        self.results = results

    def run(self):
        while not self.task_queue.empty():
            try:
                chunk = self.task_queue.get_nowait()
                self.results.append(map_function(chunk))
            except:
                break
2. Implementing the Map Function

The Map function processes lines from the text file and returns word-count pairs.


def map_function(chunk):
    word_counts = []
    for line in chunk:
        words = line.split()
        word_counts.extend([(word, 1) for word in words])
    return word_counts
3. Distributing Data

Divide the text file into chunks for processing by nodes.


from queue import Queue

# Split file into chunks
def split_file(file_path, num_chunks):
    with open(file_path, 'r') as f:
        lines = f.readlines()
    chunk_size = len(lines) // num_chunks
    return [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]

# Initialize task queue
task_queue = Queue()
chunks = split_file('input.txt', 4)
for chunk in chunks:
    task_queue.put(chunk)
4. Implementing the Reduce Function

The Reduce function aggregates word counts from all nodes.


from collections import defaultdict

def reduce_function(mapped_results):
    final_word_count = defaultdict(int)
    for result in mapped_results:
        for word, count in result:
            final_word_count[word] += count
    return final_word_count
5. Simulating Fault Tolerance

Randomly simulate node failure and reassign failed tasks to remaining nodes.


import random

def simulate_failure(node_list):
    # Randomly deactivate one node
    failed_node = random.choice(node_list)
    print(f"Node {failed_node.id} failed!")
    node_list.remove(failed_node)
    return node_list
6. Running the Simulation

Execute the MapReduce job with fault tolerance and aggregation.


if __name__ == "__main__":
    num_nodes = 4
    results = []
    nodes = [Node(i, task_queue, results) for i in range(num_nodes)]

    for node in nodes:
        node.start()

    # Simulate failure
    nodes = simulate_failure(nodes)

    for node in nodes:
        node.join()

    # Aggregate results
    word_count = reduce_function(results)
    print(word_count)
7. Visualizing Results

Generate a bar chart of the word counts.


import matplotlib.pyplot as plt

def visualize_results(word_count):
    words = list(word_count.keys())
    counts = list(word_count.values())

    plt.bar(words, counts)
    plt.xlabel("Words")
    plt.ylabel("Frequency")
    plt.title("Word Count Visualization")
    plt.show()

visualize_results(word_count)

Encouragement

Learning comes from doing! If you attempted the assignment and then reviewed the solution, you’re on the right path. Reflect on the differences between your approach and the solution to deepen your understanding. Keep practicing and building your problem-solving skills!