Markov Decision Processes (MDP) - CSU083 | Shoolini University

Markov Decision Processes (MDP)

1. Prerequisites

Before diving into Markov Decision Processes (MDPs), understanding the following concepts is crucial:

1.1 Probability Theory

1.2 Linear Algebra

1.3 Dynamic Programming

1.4 Reinforcement Learning Basics

2. What is a Markov Decision Process (MDP)?

MDPs provide a mathematical framework for decision-making in stochastic environments where outcomes are partly random and partly under the control of a decision-maker.

2.1 Components of an MDP

2.2 Bellman Equations

The value of a state is recursively defined using the Bellman equation:

$$ V(s) = \max_{a} \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')] $$

This equation helps compute the optimal value function.

2.3 Solving MDPs

3. Why Does This Algorithm Exist?

MDPs exist to model sequential decision-making problems where outcomes are uncertain and rewards are delayed. Key applications include:

3.1 Robotics

3.2 Finance

3.3 Healthcare

3.4 Reinforcement Learning

4. When Should You Use It?

MDPs are ideal for situations where:

4.1 Sequential Decision-Making is Required

4.2 Uncertainty in Outcomes Exists

4.3 Long-Term Reward Optimization is Necessary

4.4 Dynamic Environments Need to be Modeled

5. How Does It Compare to Alternatives?

5.1 Strengths

5.2 Weaknesses

5.3 Comparison with Alternatives

Method Strength Weakness
MDP Handles uncertainty, optimal policy guarantee Computationally expensive
Heuristic Methods Fast, works well for specific tasks No guarantee of optimality
Reinforcement Learning (Q-learning) Can learn without a known model Requires large amounts of data
Monte Carlo Methods Useful for model-free environments Slow convergence, high variance

6. Basic Implementation

Below is a basic implementation of a Markov Decision Process (MDP) using Python. The implementation includes:


import numpy as np

# Define MDP components
states = ['S1', 'S2', 'S3']
actions = ['A1', 'A2']
transition_prob = {
    'S1': {'A1': {'S1': 0.5, 'S2': 0.5}, 'A2': {'S2': 1.0}},
    'S2': {'A1': {'S1': 0.2, 'S3': 0.8}, 'A2': {'S3': 1.0}},
    'S3': {'A1': {'S3': 1.0}, 'A2': {'S3': 1.0}}
}
rewards = {
    'S1': {'A1': 5, 'A2': 10},
    'S2': {'A1': -1, 'A2': 2},
    'S3': {'A1': 0, 'A2': 0}
}

gamma = 0.9  # Discount factor
threshold = 0.001  # Convergence threshold

# Initialize value function
value_function = {s: 0 for s in states}

def value_iteration():
    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)
            
            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function

# Run Value Iteration
optimal_values = value_iteration()

# Print the optimal value function
print("Optimal Value Function:")
for state, value in optimal_values.items():
    print(f"V({state}) = {value:.4f}")

7. Dry Run: Step-by-Step Execution

7.1 Initial State

At the beginning, all state values are initialized to 0:


V(S1) = 0.0000
V(S2) = 0.0000
V(S3) = 0.0000

7.2 First Iteration

Similarly, calculations are performed for S2 and S3.

7.3 Intermediate Iterations

Each iteration updates the value function based on the Bellman equation. The values converge over multiple iterations:


Iteration 1:
V(S1) = 10.0000
V(S2) = 2.0000
V(S3) = 0.0000

Iteration 2:
V(S1) = 10.9000
V(S2) = 3.8000
V(S3) = 0.0000

Iteration 3:
V(S1) = 11.4200
V(S2) = 4.4200
V(S3) = 0.0000
...
Converges when delta < 0.001

7.4 Final Converged Values

After multiple iterations, the final value function is obtained:


V(S1) = 11.78
V(S2) = 5.32
V(S3) = 0.00

7.5 Interpretation

8. Time & Space Complexity Analysis

Markov Decision Processes (MDPs) are typically solved using algorithms like Value Iteration and Policy Iteration. The time and space complexity of these approaches depend on:

8.1 Value Iteration Complexity

8.2 Policy Iteration Complexity

8.3 Comparison of Complexity

Algorithm Time Complexity Best Use Case
Value Iteration \( O(S^2 A T) \) Small state spaces, faster convergence required
Policy Iteration \( O(S^3 A) \) Larger state spaces, better stability

9. Space Complexity Analysis

MDPs require storage for:

Thus, total space complexity is:

\( O(S^2 A) \) worst-case.

9.1 Space Consumption vs. Input Size

10. Trade-offs in MDP Algorithms

MDP solutions involve trade-offs between time, space, and accuracy.

10.1 Computational Trade-offs

10.2 Space vs. Accuracy

10.3 Discount Factor \( \gamma \)

10.4 Alternatives for Scalability

11. Optimizations & Variants

MDPs can be computationally expensive, but optimizations and algorithmic variants can significantly improve efficiency.

11.1 Common Optimizations

11.1.1 Adaptive Value Iteration
11.1.2 Prioritized Sweeping
11.1.3 Asynchronous Value Iteration
11.1.4 Function Approximation

11.2 Variants of MDP Algorithms

11.2.1 Partially Observable MDP (POMDP)
11.2.2 Average Reward MDP
11.2.3 Hierarchical MDP (H-MDP)
11.2.4 Factored MDP

12. Iterative vs. Recursive Implementations

MDP algorithms can be implemented iteratively or recursively, each with its trade-offs.

12.1 Iterative Approach

The standard approach in Value Iteration and Policy Iteration follows an iterative loop:


def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, threshold=0.001):
    value_function = {s: 0 for s in states}

    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * 
                    (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)

            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function
Pros of Iterative Approach
Cons of Iterative Approach

12.2 Recursive Approach

Recursion allows an alternative way to compute value functions:


def value_iteration_recursive(state, states, actions, transition_prob, rewards, value_function, gamma=0.9, threshold=0.001):
    max_value = float('-inf')
    
    for action in actions:
        expected_value = sum(
            transition_prob[state][action][next_state] * 
            (rewards[state][action] + gamma * value_iteration_recursive(next_state, states, actions, transition_prob, rewards, value_function, gamma, threshold))
            for next_state in transition_prob[state][action]
        )
        max_value = max(max_value, expected_value)

    return max_value if abs(value_function[state] - max_value) > threshold else value_function[state]
Pros of Recursive Approach
Cons of Recursive Approach

12.3 Iterative vs. Recursive Comparison

Method Time Complexity Space Complexity Best Use Case
Iterative \( O(S^2 A T) \) \( O(S) \) Large state spaces, stable convergence
Recursive \( O(S^2 A) \) (assuming memoization) \( O(S) \) (stack depth) Small state spaces, ease of implementation

13. Edge Cases & Failure Handling

MDPs can fail or behave unexpectedly in various scenarios. Recognizing edge cases and handling failures is crucial.

13.1 Common Pitfalls & Edge Cases

13.1.1 Infinite Loops in Value Iteration
13.1.2 Zero Transition Probability
13.1.3 Negative Rewards & Infinite Loops
13.1.4 High Discount Factor (\( \gamma \)) Impact
13.1.5 Unreachable Terminal States

14. Test Cases to Verify Correctness

To ensure correctness, MDP implementations should be tested with various input scenarios.

14.1 Basic Test Cases

14.1.1 Test Convergence

Ensure value iteration terminates within a reasonable number of iterations.


def test_convergence():
    states = ['S1', 'S2']
    actions = ['A1']
    transition_prob = {
        'S1': {'A1': {'S2': 1.0}},
        'S2': {'A1': {'S1': 1.0}}
    }
    rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}
    
    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert abs(optimal_values['S1'] - 10) < 0.1, "Convergence failed"
14.1.2 Test Unreachable State Handling

MDP should detect unreachable states and handle them correctly.


def test_unreachable_state():
    states = ['S1', 'S2']
    actions = ['A1']
    transition_prob = {
        'S1': {'A1': {'S2': 1.0}},
        'S2': {'A1': {'S2': 1.0}}  # S2 is an absorbing state
    }
    rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}

    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert optimal_values['S2'] == 2, "Unreachable state miscomputed"
14.1.3 Test Large State Space

Ensure performance does not degrade in large state spaces.


def test_large_state_space():
    states = [f"S{i}" for i in range(100)]
    actions = ['A1']
    transition_prob = {s: {'A1': {s: 1.0}} for s in states}
    rewards = {s: {'A1': i} for i, s in enumerate(states)}

    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert optimal_values['S99'] == 99, "Large state space failed"

15. Real-World Failure Scenarios

15.1 Reinforcement Learning with Poor Reward Shaping

15.2 Stochastic Environments with Unmodeled Dynamics

15.3 High Computational Costs in Large-Scale Problems

15.4 Policy Instability in Real-Time Systems

16. Real-World Applications & Industry Use Cases

Markov Decision Processes (MDPs) are widely used in various industries to solve sequential decision-making problems under uncertainty.

16.1 Robotics & Automation

16.2 Finance & Economics

16.3 Healthcare

16.4 Artificial Intelligence & Gaming

16.5 Traffic Management & Smart Cities

17. Open-Source Implementations

Several open-source libraries and frameworks implement MDPs efficiently:

17.1 MDPToolbox (Python)

17.2 OpenAI Gym

17.3 RLlib (Ray Reinforcement Learning)

18. Practical Project: AI-Based Traffic Signal Control

This project simulates a smart traffic light system using MDPs.

18.1 Problem Statement

Design an MDP-based system to optimize traffic signal timing at an intersection, minimizing wait times.

18.2 State Space

18.3 Actions

18.4 Reward Function

18.5 Python Implementation


import numpy as np

# Define MDP components
states = ['LowTraffic', 'MediumTraffic', 'HighTraffic']
actions = ['ShortGreen', 'MediumGreen', 'LongGreen']

# Transition probabilities
transition_prob = {
    'LowTraffic': {'ShortGreen': {'LowTraffic': 0.8, 'MediumTraffic': 0.2},
                   'MediumGreen': {'LowTraffic': 0.9, 'MediumTraffic': 0.1},
                   'LongGreen': {'LowTraffic': 1.0}},
    'MediumTraffic': {'ShortGreen': {'MediumTraffic': 0.7, 'HighTraffic': 0.3},
                      'MediumGreen': {'MediumTraffic': 0.6, 'LowTraffic': 0.4},
                      'LongGreen': {'LowTraffic': 1.0}},
    'HighTraffic': {'ShortGreen': {'HighTraffic': 0.9, 'MediumTraffic': 0.1},
                    'MediumGreen': {'MediumTraffic': 0.8, 'LowTraffic': 0.2},
                    'LongGreen': {'LowTraffic': 1.0}}
}

# Reward function
rewards = {
    'LowTraffic': {'ShortGreen': 5, 'MediumGreen': 10, 'LongGreen': 15},
    'MediumTraffic': {'ShortGreen': -5, 'MediumGreen': 5, 'LongGreen': 10},
    'HighTraffic': {'ShortGreen': -10, 'MediumGreen': -5, 'LongGreen': 5}
}

gamma = 0.9  # Discount factor
threshold = 0.001  # Convergence threshold

# Initialize value function
value_function = {s: 0 for s in states}

def value_iteration():
    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * 
                    (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)

            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function

# Run Value Iteration
optimal_values = value_iteration()

# Print the optimal value function
print("Optimal Traffic Signal Timing Policy:")
for state, value in optimal_values.items():
    print(f"V({state}) = {value:.4f}")

18.6 Expected Outcome

18.7 Extensions

19. Competitive Programming & System Design Integration

19.1 Using MDP in Competitive Programming

MDPs are useful in coding contests for problems that involve decision-making over time with uncertain outcomes. Key areas where MDPs are applicable:

Example Problem: Probabilistic Pathfinding

Given an \( N \times N \) grid where each cell has a probability of being blocked, find the optimal path to reach the destination with the highest probability of success.

19.2 System Design Integration

MDPs play a crucial role in designing large-scale systems where long-term decision-making is required. Some real-world use cases:

Example System Design Problem

Design a dynamic pricing model for an e-commerce platform using MDP. The system should:

20. Assignments

20.1 Solve at Least 10 Problems Using MDP

Practice solving problems related to Markov Decision Processes. Recommended problems:

  1. Grid Navigation with Obstacles (Optimal Path Planning)
  2. Robot Movement in a Probabilistic Environment
  3. Game Strategy Optimization (Chess-like Move Selection)
  4. Stock Trading Policy Optimization
  5. Warehouse Inventory Management
  6. Optimal Stopping Problems (Casino Betting Strategies)
  7. Job Scheduling in Cloud Computing
  8. Traffic Signal Optimization
  9. Dynamic Pricing in E-commerce
  10. AI Learning to Play a Simple Game

20.2 Use MDP in a System Design Problem

Apply MDP principles to design a real-world system. Choose one of the following:

20.3 Implement MDP Under Time Constraints

Simulate a competitive programming environment:

Bonus: Compare execution time between Value Iteration and Policy Iteration for large state spaces.