Homework 2 - DMJCCLT - dmj.one

Homework 2: Multidisciplinary Grid and Gossip-Based Distributed Systems Design

Objective: Design a real-world distributed system that optimally integrates grid computing and gossip-based protocols to monitor, allocate resources, and detect failures in a federated infrastructure, like a healthcare system or smart city grid.

Task Description

Develop a prototype for a Smart City Infrastructure Management System (SCIMS). The system must:

Skills You Will Develop

Steps to Complete

1. Analyze the System Requirements

Identify key municipal tasks that require distributed computation (e.g., real-time traffic optimization). Determine potential failure scenarios for each task (e.g., IoT sensors losing connectivity).

2. Grid Computing Design

3. Gossip Protocol Implementation

4. Fault Tolerance Analysis

Simulate failure scenarios where up to 30% of nodes fail. Use statistical models to analyze how quickly the system recovers using gossip and grid computing.

5. Scalability Testing

Optimize the system to handle real-time operations with 10,000+ devices. Ensure that the communication overhead and latency remain within acceptable limits.

Deliverables

Evaluation Criteria

Bonus Challenge

Optimize your design to include suspicion mechanisms for false-positive reduction in failure detection (e.g., using SWIM's suspicion model).

Solution for Multidisciplinary Grid and Gossip-Based Distributed Systems Design

Below is a detailed solution to the Smart City Infrastructure Management System (SCIMS) design. The solution integrates grid computing for resource allocation and gossip protocols for failure detection and fault tolerance.

System Architecture

The architecture combines the following components:

Flow: The grid layer assigns tasks, while the gossip layer ensures system health and updates membership in case of failures.

Code Implementation

1. Grid Task Scheduling (HTCondor Example)

from htcondor import Schedd, ClassAd

def submit_task(task_script, requirements):
    schedd = Schedd()
    submit = ClassAd({
        'Executable': task_script,
        'Requirements': requirements,
        'Log': 'job.log',
        'Output': 'job.out',
        'Error': 'job.err'
    })
    with schedd.transaction() as txn:
        submit.queue(txn)
    print("Task submitted to grid scheduler.")
  
2. Gossip Protocol for Failure Detection

import random

class GossipProtocol:
    def __init__(self, membership_list):
        self.membership_list = membership_list
        self.failed_nodes = set()

    def gossip(self):
        for node in self.membership_list:
            if random.choice([True, False]):  # Simulate random message delivery
                self.send_heartbeat(node)

    def send_heartbeat(self, node):
        if node not in self.failed_nodes:
            print(f"Heartbeat sent to {node}")

    def detect_failure(self, node):
        print(f"Node {node} detected as failed")
        self.failed_nodes.add(node)
  
3. Simulation of Fault Tolerance

def simulate_failures(protocol, failure_rate):
    total_nodes = len(protocol.membership_list)
    failed = int(total_nodes * failure_rate)
    for i in range(failed):
        node = protocol.membership_list.pop(0)
        protocol.detect_failure(node)

    protocol.gossip()
    print(f"Active nodes: {len(protocol.membership_list)}")
    print(f"Failed nodes: {len(protocol.failed_nodes)}")
  

Simulation Results

For a system with 10,000 nodes and a failure rate of 30%:

Report Summary

The SCIMS effectively integrates grid computing and gossip protocols to ensure efficient resource utilization and failure resilience. Trade-offs include increased communication overhead in gossip but reduced failure detection time compared to traditional methods.

By using SWIM for membership updates and HTCondor for task scheduling, the system achieves scalability and robustness, essential for real-world applications in smart city infrastructure.

Encouragement

Learning comes from doing! If you attempted the assignment and then reviewed the solution, you’re on the right path. Reflect on the differences between your approach and the solution to deepen your understanding. Keep practicing and building your problem-solving skills!