Markov Decision Processes (MDP)

1. Prerequisites

Before diving into Markov Decision Processes (MDPs), understanding the following concepts is crucial:

1.1 Probability Theory

Random Variables: Variables that take different values based on probability distributions.
Conditional Probability: The probability of an event occurring given another event has occurred.
Markov Property: The future state depends only on the current state, not past states.

1.2 Linear Algebra

Matrices and Vectors: Representing state transitions and policies.
Eigenvalues and Eigenvectors: Useful for understanding stability in iterative processes.

1.3 Dynamic Programming

Bellman Equation: A recursive formulation to find optimal solutions in decision-making problems.
Value and Policy Iteration: Fundamental iterative approaches in MDPs.

1.4 Reinforcement Learning Basics

Agent-Environment Interaction: Understanding how an agent interacts with an environment over time.
Rewards and Policies: Mechanisms governing decision-making in sequential problems.

2. What is a Markov Decision Process (MDP)?

MDPs provide a mathematical framework for decision-making in stochastic environments where outcomes are partly random and partly under the control of a decision-maker.

2.1 Components of an MDP

States (S): The set of all possible situations the system can be in.
Actions (A): The choices available to an agent at any given state.
Transition Probability (P): The probability of moving from one state to another given an action.
Reward Function (R): The immediate feedback received after taking an action.
Policy (π): A strategy mapping states to actions.
Discount Factor (γ): A value between 0 and 1 that determines the importance of future rewards.

2.2 Bellman Equations

The value of a state is recursively defined using the Bellman equation:

$$ V(s) = \max_{a} \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')] $$

This equation helps compute the optimal value function.

2.3 Solving MDPs

Value Iteration: Iteratively updating state values using Bellman updates.
Policy Iteration: Alternating between policy evaluation and improvement.

3. Why Does This Algorithm Exist?

MDPs exist to model sequential decision-making problems where outcomes are uncertain and rewards are delayed. Key applications include:

3.1 Robotics

Path planning for autonomous robots in uncertain environments.
Adaptive control systems for robotic arms.

3.2 Finance

Portfolio optimization in uncertain markets.
Risk management and investment decision-making.

3.3 Healthcare

Optimizing treatment plans for chronic diseases.
Decision support in clinical trials and drug scheduling.

3.4 Reinforcement Learning

Used as the foundation for AI agents in games and autonomous systems.
Training models in real-time dynamic environments like self-driving cars.

4. When Should You Use It?

MDPs are ideal for situations where:

4.1 Sequential Decision-Making is Required

Problems involve a sequence of actions over time.
Each decision impacts future possibilities.

4.2 Uncertainty in Outcomes Exists

Actions lead to probabilistic state transitions.
Full knowledge of future states is unavailable.

4.3 Long-Term Reward Optimization is Necessary

Delayed rewards impact decision-making.
The goal is to maximize cumulative rewards over time.

4.4 Dynamic Environments Need to be Modeled

Environments where conditions evolve over time.
Adapting policies in response to changes is crucial.

5. How Does It Compare to Alternatives?

5.1 Strengths

Mathematically Rigid: Provides a formal framework for decision-making under uncertainty.
Guaranteed Convergence: Solutions obtained via value or policy iteration converge to optimal values.
Handles Stochasticity Well: Accounts for uncertain state transitions naturally.

5.2 Weaknesses

Computationally Expensive: Solving large-scale MDPs requires significant memory and computation.
Requires a Known Transition Model: Not always feasible in real-world problems where the model is unknown.
Limited Scalability: Struggles with high-dimensional state-action spaces (curse of dimensionality).

5.3 Comparison with Alternatives

Method	Strength	Weakness
MDP	Handles uncertainty, optimal policy guarantee	Computationally expensive
Heuristic Methods	Fast, works well for specific tasks	No guarantee of optimality
Reinforcement Learning (Q-learning)	Can learn without a known model	Requires large amounts of data
Monte Carlo Methods	Useful for model-free environments	Slow convergence, high variance

6. Basic Implementation

Below is a basic implementation of a Markov Decision Process (MDP) using Python. The implementation includes:

Defining states, actions, rewards, and transition probabilities.
Value iteration to find the optimal policy.


import numpy as np

# Define MDP components
states = ['S1', 'S2', 'S3']
actions = ['A1', 'A2']
transition_prob = {
    'S1': {'A1': {'S1': 0.5, 'S2': 0.5}, 'A2': {'S2': 1.0}},
    'S2': {'A1': {'S1': 0.2, 'S3': 0.8}, 'A2': {'S3': 1.0}},
    'S3': {'A1': {'S3': 1.0}, 'A2': {'S3': 1.0}}
}
rewards = {
    'S1': {'A1': 5, 'A2': 10},
    'S2': {'A1': -1, 'A2': 2},
    'S3': {'A1': 0, 'A2': 0}
}

gamma = 0.9  # Discount factor
threshold = 0.001  # Convergence threshold

# Initialize value function
value_function = {s: 0 for s in states}

def value_iteration():
    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)
            
            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function

# Run Value Iteration
optimal_values = value_iteration()

# Print the optimal value function
print("Optimal Value Function:")
for state, value in optimal_values.items():
    print(f"V({state}) = {value:.4f}")

7. Dry Run: Step-by-Step Execution

7.1 Initial State

At the beginning, all state values are initialized to 0:


V(S1) = 0.0000
V(S2) = 0.0000
V(S3) = 0.0000

7.2 First Iteration

For S1, evaluating action A1:
- Expected value: $ 0.5 \times (5 + 0.9 \times 0) + 0.5 \times (5 + 0.9 \times 0) = 5 $
For S1, evaluating action A2:
- Expected value: $ 1.0 \times (10 + 0.9 \times 0) = 10 $
Max value for S1: $ V(S1) = 10 $

Similarly, calculations are performed for S2 and S3.

7.3 Intermediate Iterations

Each iteration updates the value function based on the Bellman equation. The values converge over multiple iterations:


Iteration 1:
V(S1) = 10.0000
V(S2) = 2.0000
V(S3) = 0.0000

Iteration 2:
V(S1) = 10.9000
V(S2) = 3.8000
V(S3) = 0.0000

Iteration 3:
V(S1) = 11.4200
V(S2) = 4.4200
V(S3) = 0.0000
...
Converges when delta < 0.001

7.4 Final Converged Values

After multiple iterations, the final value function is obtained:


V(S1) = 11.78
V(S2) = 5.32
V(S3) = 0.00

7.5 Interpretation

S1 has the highest value, meaning the best expected future rewards start from here.
S3 has a value of 0 since no further rewards exist.
The agent will choose actions that maximize these values, leading to an optimal policy.

8. Time & Space Complexity Analysis

Markov Decision Processes (MDPs) are typically solved using algorithms like Value Iteration and Policy Iteration. The time and space complexity of these approaches depend on:

$ S $ - Number of states
$ A $ - Number of actions per state
$ T $ - Number of iterations until convergence

8.1 Value Iteration Complexity

Worst-case time complexity: $ O(S^2 A T) $ due to updating each state's value using transition probabilities.
Best-case time complexity: $ O(S A T) $ if convergence happens quickly.
Average-case time complexity: Depends on the discount factor $ \gamma $, but typically $ O(S^2 A T) $.

8.2 Policy Iteration Complexity

Worst-case time complexity: $ O(S^3 A) $ due to solving a system of linear equations in policy evaluation.
Best-case time complexity: $ O(S A) $ if the policy converges in a few iterations.
Average-case time complexity: $ O(S^2 A) $ assuming moderate iterations.

8.3 Comparison of Complexity

Algorithm	Time Complexity	Best Use Case
Value Iteration	$ O(S^2 A T) $	Small state spaces, faster convergence required
Policy Iteration	$ O(S^3 A) $	Larger state spaces, better stability

9. Space Complexity Analysis

MDPs require storage for:

State Values: $ O(S) $, as each state stores a value.
Policy Storage: $ O(S) $, storing an action for each state.
Transition Matrix: $ O(S^2 A) $, storing probabilities for state-action pairs.
Reward Function: $ O(S A) $, mapping each (state, action) pair to a reward.

Thus, total space complexity is:

$ O(S^2 A) $ worst-case.

9.1 Space Consumption vs. Input Size

For small MDPs: Space is manageable as state-action pairs are limited.
For large MDPs: Memory can explode due to transition matrices growing as $ S^2 $.
Solution: Use sparse representations or approximation methods (e.g., function approximation in reinforcement learning).

10. Trade-offs in MDP Algorithms

MDP solutions involve trade-offs between time, space, and accuracy.

10.1 Computational Trade-offs

Value Iteration: Fast convergence but computationally expensive for large state spaces.
Policy Iteration: Fewer iterations but each iteration is costly.
Approximate Methods: Reduce complexity but may sacrifice accuracy.

10.2 Space vs. Accuracy

Storing full transition matrices improves accuracy but consumes excessive memory.
Sparse matrix representations reduce space usage at the cost of slower lookups.

10.3 Discount Factor $ \gamma $

Higher $ \gamma $ (e.g., 0.99) prioritizes long-term rewards but increases computation time.
Lower $ \gamma $ (e.g., 0.5) converges faster but may ignore distant rewards.

10.4 Alternatives for Scalability

Monte Carlo Methods: Require less memory but have high variance.
Deep Q-Networks (DQN): Use neural networks to approximate MDP solutions.
Hierarchical MDPs: Reduce complexity by decomposing large MDPs into smaller subproblems.

11. Optimizations & Variants

MDPs can be computationally expensive, but optimizations and algorithmic variants can significantly improve efficiency.

11.1 Common Optimizations

11.1.1 Adaptive Value Iteration

Instead of updating all states in each iteration, update only states with significant changes.
Reduces unnecessary computations and speeds up convergence.

11.1.2 Prioritized Sweeping

Processes states with the highest estimated impact on policy first.
Uses a priority queue to focus on the most relevant state transitions.

11.1.3 Asynchronous Value Iteration

Updates values of states independently rather than synchronously.
Reduces computational overhead while maintaining convergence guarantees.

11.1.4 Function Approximation

For large MDPs, store value functions using neural networks instead of tables.
Common in Deep Reinforcement Learning (DQN, PPO).

11.2 Variants of MDP Algorithms

11.2.1 Partially Observable MDP (POMDP)

Used when the agent does not have full knowledge of the state.
Maintains a belief state, which is a probability distribution over possible states.
Example: Robotics with noisy sensors.

11.2.2 Average Reward MDP

Focuses on maximizing long-term average rewards instead of discounted rewards.
Useful in scenarios where discounting future rewards is undesirable.
Example: Continuous production systems in manufacturing.

11.2.3 Hierarchical MDP (H-MDP)

Breaks large problems into smaller sub-MDPs for efficiency.
Improves scalability in large-scale decision-making.
Example: Multi-level task planning in AI.

11.2.4 Factored MDP

Uses structured representations for state variables instead of flat state spaces.
Reduces computation for high-dimensional state-action spaces.
Example: Multi-agent systems in traffic management.

12. Iterative vs. Recursive Implementations

MDP algorithms can be implemented iteratively or recursively, each with its trade-offs.

12.1 Iterative Approach

The standard approach in Value Iteration and Policy Iteration follows an iterative loop:


def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, threshold=0.001):
    value_function = {s: 0 for s in states}

    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * 
                    (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)

            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function

Pros of Iterative Approach

Efficient in memory usage ($O(S)$).
Explicit control over stopping conditions.
Can be optimized with heuristics (e.g., prioritized sweeping).

Cons of Iterative Approach

Slower convergence for large MDPs.
May require many iterations to reach an optimal policy.

12.2 Recursive Approach

Recursion allows an alternative way to compute value functions:


def value_iteration_recursive(state, states, actions, transition_prob, rewards, value_function, gamma=0.9, threshold=0.001):
    max_value = float('-inf')
    
    for action in actions:
        expected_value = sum(
            transition_prob[state][action][next_state] * 
            (rewards[state][action] + gamma * value_iteration_recursive(next_state, states, actions, transition_prob, rewards, value_function, gamma, threshold))
            for next_state in transition_prob[state][action]
        )
        max_value = max(max_value, expected_value)

    return max_value if abs(value_function[state] - max_value) > threshold else value_function[state]

Pros of Recursive Approach

More readable and easier to express mathematical recurrence.
Can be useful for dynamic programming with memoization.

Cons of Recursive Approach

High memory overhead due to function call stack ($O(S)$ recursive depth).
May lead to stack overflow for large state spaces.

12.3 Iterative vs. Recursive Comparison

Method	Time Complexity	Space Complexity	Best Use Case
Iterative	$ O(S^2 A T) $	$ O(S) $	Large state spaces, stable convergence
Recursive	$ O(S^2 A) $ (assuming memoization)	$ O(S) $ (stack depth)	Small state spaces, ease of implementation

13. Edge Cases & Failure Handling

MDPs can fail or behave unexpectedly in various scenarios. Recognizing edge cases and handling failures is crucial.

13.1 Common Pitfalls & Edge Cases

13.1.1 Infinite Loops in Value Iteration

If the stopping threshold is too small or improperly set, the algorithm may run indefinitely.
Solution: Set a reasonable threshold (e.g., $ 10^{-3} $ to $ 10^{-6} $) and limit iterations.

13.1.2 Zero Transition Probability

If a state-action pair has zero probability of transitioning, the state may become unreachable.
Solution: Ensure every state-action pair has at least one reachable state.

13.1.3 Negative Rewards & Infinite Loops

MDPs with negative rewards can result in unbounded loops where the agent avoids termination states.
Solution: Introduce absorbing states or discounting to prevent infinite exploitation of negative loops.

13.1.4 High Discount Factor ($ \gamma $) Impact

A high $ \gamma $ (e.g., 0.99) places excessive importance on long-term rewards, slowing convergence.
Solution: Use a balanced discount factor (e.g., 0.9).

13.1.5 Unreachable Terminal States

If certain terminal states have no path, the algorithm may keep looping in non-terminal states.
Solution: Ensure all terminal states have feasible transitions.

14. Test Cases to Verify Correctness

To ensure correctness, MDP implementations should be tested with various input scenarios.

14.1 Basic Test Cases

14.1.1 Test Convergence

Ensure value iteration terminates within a reasonable number of iterations.


def test_convergence():
    states = ['S1', 'S2']
    actions = ['A1']
    transition_prob = {
        'S1': {'A1': {'S2': 1.0}},
        'S2': {'A1': {'S1': 1.0}}
    }
    rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}
    
    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert abs(optimal_values['S1'] - 10) < 0.1, "Convergence failed"

14.1.2 Test Unreachable State Handling

MDP should detect unreachable states and handle them correctly.


def test_unreachable_state():
    states = ['S1', 'S2']
    actions = ['A1']
    transition_prob = {
        'S1': {'A1': {'S2': 1.0}},
        'S2': {'A1': {'S2': 1.0}}  # S2 is an absorbing state
    }
    rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}

    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert optimal_values['S2'] == 2, "Unreachable state miscomputed"

14.1.3 Test Large State Space

Ensure performance does not degrade in large state spaces.


def test_large_state_space():
    states = [f"S{i}" for i in range(100)]
    actions = ['A1']
    transition_prob = {s: {'A1': {s: 1.0}} for s in states}
    rewards = {s: {'A1': i} for i, s in enumerate(states)}

    optimal_values = value_iteration(states, actions, transition_prob, rewards)
    assert optimal_values['S99'] == 99, "Large state space failed"

15. Real-World Failure Scenarios

15.1 Reinforcement Learning with Poor Reward Shaping

Problem: If rewards are sparse, the agent may fail to learn a meaningful policy.
Example: A robot learning to walk with rewards only at the final step may never reach it.
Solution: Use reward shaping to provide intermediate rewards.

15.2 Stochastic Environments with Unmodeled Dynamics

Problem: If real-world transition probabilities differ from those modeled, the policy may fail.
Example: Self-driving cars using MDPs may struggle if road conditions are not accounted for.
Solution: Use adaptive MDPs that update transition probabilities dynamically.

15.3 High Computational Costs in Large-Scale Problems

Problem: MDPs with large state spaces become computationally intractable.
Example: Multi-agent traffic systems modeling millions of vehicles.
Solution: Use hierarchical or function approximation methods (e.g., Deep Q-Networks).

15.4 Policy Instability in Real-Time Systems

Problem: If the MDP environment changes dynamically, policies may become invalid.
Example: Dynamic pricing models in e-commerce failing due to sudden demand spikes.
Solution: Implement online MDP learning with real-time adjustments.

16. Real-World Applications & Industry Use Cases

Markov Decision Processes (MDPs) are widely used in various industries to solve sequential decision-making problems under uncertainty.

16.1 Robotics & Automation

Path Planning: Autonomous robots use MDPs for navigation in uncertain environments.
Task Scheduling: Industrial robots determine the optimal sequence of actions to maximize efficiency.

16.2 Finance & Economics

Portfolio Optimization: MDPs help investors determine the best strategy for buying/selling assets over time.
Risk Management: Used in fraud detection, credit risk analysis, and algorithmic trading.

16.3 Healthcare

Treatment Planning: MDPs optimize personalized treatment plans for chronic diseases.
Drug Discovery: Helps model the impact of new drugs through clinical trials and simulations.

16.4 Artificial Intelligence & Gaming

Reinforcement Learning (RL): Deep Q-learning uses MDPs as the foundation for AI training.
Game AI: Video game NPCs use MDP-based strategies for decision-making.

16.5 Traffic Management & Smart Cities

Adaptive Traffic Signals: MDPs optimize traffic light timings to reduce congestion.
Autonomous Vehicles: Self-driving cars use MDPs for decision-making in dynamic environments.

17. Open-Source Implementations

Several open-source libraries and frameworks implement MDPs efficiently:

17.1 MDPToolbox (Python)

Repository: MDPToolbox
Features: Implements value iteration, policy iteration, and Q-learning.

17.2 OpenAI Gym

Repository: OpenAI Gym
Features: Provides reinforcement learning environments built on MDPs.

17.3 RLlib (Ray Reinforcement Learning)

Repository: Ray RLlib
Features: Scalable reinforcement learning framework using MDP principles.

18. Practical Project: AI-Based Traffic Signal Control

This project simulates a smart traffic light system using MDPs.

18.1 Problem Statement

Design an MDP-based system to optimize traffic signal timing at an intersection, minimizing wait times.

18.2 State Space

State = {LowTraffic, MediumTraffic, HighTraffic}

18.3 Actions

Actions = {ShortGreen, MediumGreen, LongGreen}

18.4 Reward Function

Minimizing waiting time is rewarded.
Heavy congestion receives penalties.

18.5 Python Implementation


import numpy as np

# Define MDP components
states = ['LowTraffic', 'MediumTraffic', 'HighTraffic']
actions = ['ShortGreen', 'MediumGreen', 'LongGreen']

# Transition probabilities
transition_prob = {
    'LowTraffic': {'ShortGreen': {'LowTraffic': 0.8, 'MediumTraffic': 0.2},
                   'MediumGreen': {'LowTraffic': 0.9, 'MediumTraffic': 0.1},
                   'LongGreen': {'LowTraffic': 1.0}},
    'MediumTraffic': {'ShortGreen': {'MediumTraffic': 0.7, 'HighTraffic': 0.3},
                      'MediumGreen': {'MediumTraffic': 0.6, 'LowTraffic': 0.4},
                      'LongGreen': {'LowTraffic': 1.0}},
    'HighTraffic': {'ShortGreen': {'HighTraffic': 0.9, 'MediumTraffic': 0.1},
                    'MediumGreen': {'MediumTraffic': 0.8, 'LowTraffic': 0.2},
                    'LongGreen': {'LowTraffic': 1.0}}
}

# Reward function
rewards = {
    'LowTraffic': {'ShortGreen': 5, 'MediumGreen': 10, 'LongGreen': 15},
    'MediumTraffic': {'ShortGreen': -5, 'MediumGreen': 5, 'LongGreen': 10},
    'HighTraffic': {'ShortGreen': -10, 'MediumGreen': -5, 'LongGreen': 5}
}

gamma = 0.9  # Discount factor
threshold = 0.001  # Convergence threshold

# Initialize value function
value_function = {s: 0 for s in states}

def value_iteration():
    while True:
        delta = 0
        new_value_function = value_function.copy()

        for state in states:
            max_value = float('-inf')
            for action in actions:
                expected_value = sum(
                    transition_prob[state][action][next_state] * 
                    (rewards[state][action] + gamma * value_function[next_state])
                    for next_state in transition_prob[state][action]
                )
                max_value = max(max_value, expected_value)

            new_value_function[state] = max_value
            delta = max(delta, abs(value_function[state] - max_value))

        value_function.update(new_value_function)

        if delta < threshold:
            break

    return value_function

# Run Value Iteration
optimal_values = value_iteration()

# Print the optimal value function
print("Optimal Traffic Signal Timing Policy:")
for state, value in optimal_values.items():
    print(f"V({state}) = {value:.4f}")

18.6 Expected Outcome

The system will determine the best green light duration based on traffic levels.
It will minimize congestion and optimize traffic flow at intersections.

18.7 Extensions

Integrate real-time traffic data.
Use Reinforcement Learning to dynamically update MDP parameters.

19. Competitive Programming & System Design Integration

19.1 Using MDP in Competitive Programming

MDPs are useful in coding contests for problems that involve decision-making over time with uncertain outcomes. Key areas where MDPs are applicable:

Grid Navigation Problems: Finding the optimal path in a probabilistic environment.
Game Theory: Modeling moves in strategy games.
Probability-based Optimization: Making the best decision when facing uncertain outcomes.
AI Strategy Games: AI learning optimal policies for turn-based games.

Example Problem: Probabilistic Pathfinding

Given an $ N \times N $ grid where each cell has a probability of being blocked, find the optimal path to reach the destination with the highest probability of success.

19.2 System Design Integration

MDPs play a crucial role in designing large-scale systems where long-term decision-making is required. Some real-world use cases:

Ad Recommendation Systems: Selecting ads to display based on user interactions over time.
Autonomous Driving: Deciding lane changes, acceleration, and braking under uncertain traffic conditions.
Cloud Resource Allocation: Allocating computing resources dynamically to minimize costs and maximize performance.
Customer Retention Models: Predicting customer behavior and adjusting marketing strategies dynamically.

Example System Design Problem

Design a dynamic pricing model for an e-commerce platform using MDP. The system should:

Determine the best price adjustments based on demand and competition.
Maximize long-term revenue instead of immediate profits.
Adapt dynamically as customer preferences evolve.

20. Assignments

20.1 Solve at Least 10 Problems Using MDP

Practice solving problems related to Markov Decision Processes. Recommended problems:

Grid Navigation with Obstacles (Optimal Path Planning)
Robot Movement in a Probabilistic Environment
Game Strategy Optimization (Chess-like Move Selection)
Stock Trading Policy Optimization
Warehouse Inventory Management
Optimal Stopping Problems (Casino Betting Strategies)
Job Scheduling in Cloud Computing
Traffic Signal Optimization
Dynamic Pricing in E-commerce
AI Learning to Play a Simple Game

20.2 Use MDP in a System Design Problem

Apply MDP principles to design a real-world system. Choose one of the following:

Develop a chatbot that learns user preferences over time.
Optimize ad display timing to maximize user engagement.
Design a reinforcement learning model for self-driving cars.
Implement a risk management system for investment portfolios.

20.3 Implement MDP Under Time Constraints

Simulate a competitive programming environment:

Pick a problem from the list above.
Set a strict time limit (e.g., 1 hour).
Implement a working solution efficiently.
Optimize for both accuracy and execution speed.

Bonus: Compare execution time between Value Iteration and Policy Iteration for large state spaces.

Algorithm	Time Complexity	Best Use Case
Value Iteration	\( O(S^2 A T) \)	Small state spaces, faster convergence required
Policy Iteration	\( O(S^3 A) \)	Larger state spaces, better stability

Method	Time Complexity	Space Complexity	Best Use Case
Iterative	\( O(S^2 A T) \)	\( O(S) \)	Large state spaces, stable convergence
Recursive	\( O(S^2 A) \) (assuming memoization)	\( O(S) \) (stack depth)	Small state spaces, ease of implementation