1. Prerequisites
Before diving into Markov Decision Processes (MDPs), understanding the following concepts is crucial:
1.1 Probability Theory
- Random Variables: Variables that take different values based on probability distributions.
- Conditional Probability: The probability of an event occurring given another event has occurred.
- Markov Property: The future state depends only on the current state, not past states.
1.2 Linear Algebra
- Matrices and Vectors: Representing state transitions and policies.
- Eigenvalues and Eigenvectors: Useful for understanding stability in iterative processes.
1.3 Dynamic Programming
- Bellman Equation: A recursive formulation to find optimal solutions in decision-making problems.
- Value and Policy Iteration: Fundamental iterative approaches in MDPs.
1.4 Reinforcement Learning Basics
- Agent-Environment Interaction: Understanding how an agent interacts with an environment over time.
- Rewards and Policies: Mechanisms governing decision-making in sequential problems.
2. What is a Markov Decision Process (MDP)?
MDPs provide a mathematical framework for decision-making in stochastic environments where outcomes are partly random and partly under the control of a decision-maker.
2.1 Components of an MDP
- States (S): The set of all possible situations the system can be in.
- Actions (A): The choices available to an agent at any given state.
- Transition Probability (P): The probability of moving from one state to another given an action.
- Reward Function (R): The immediate feedback received after taking an action.
- Policy (π): A strategy mapping states to actions.
- Discount Factor (γ): A value between 0 and 1 that determines the importance of future rewards.
2.2 Bellman Equations
The value of a state is recursively defined using the Bellman equation:
$$ V(s) = \max_{a} \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V(s')] $$
This equation helps compute the optimal value function.
2.3 Solving MDPs
- Value Iteration: Iteratively updating state values using Bellman updates.
- Policy Iteration: Alternating between policy evaluation and improvement.
3. Why Does This Algorithm Exist?
MDPs exist to model sequential decision-making problems where outcomes are uncertain and rewards are delayed. Key applications include:
3.1 Robotics
- Path planning for autonomous robots in uncertain environments.
- Adaptive control systems for robotic arms.
3.2 Finance
- Portfolio optimization in uncertain markets.
- Risk management and investment decision-making.
3.3 Healthcare
- Optimizing treatment plans for chronic diseases.
- Decision support in clinical trials and drug scheduling.
3.4 Reinforcement Learning
- Used as the foundation for AI agents in games and autonomous systems.
- Training models in real-time dynamic environments like self-driving cars.
4. When Should You Use It?
MDPs are ideal for situations where:
4.1 Sequential Decision-Making is Required
- Problems involve a sequence of actions over time.
- Each decision impacts future possibilities.
4.2 Uncertainty in Outcomes Exists
- Actions lead to probabilistic state transitions.
- Full knowledge of future states is unavailable.
4.3 Long-Term Reward Optimization is Necessary
- Delayed rewards impact decision-making.
- The goal is to maximize cumulative rewards over time.
4.4 Dynamic Environments Need to be Modeled
- Environments where conditions evolve over time.
- Adapting policies in response to changes is crucial.
5. How Does It Compare to Alternatives?
5.1 Strengths
- Mathematically Rigid: Provides a formal framework for decision-making under uncertainty.
- Guaranteed Convergence: Solutions obtained via value or policy iteration converge to optimal values.
- Handles Stochasticity Well: Accounts for uncertain state transitions naturally.
5.2 Weaknesses
- Computationally Expensive: Solving large-scale MDPs requires significant memory and computation.
- Requires a Known Transition Model: Not always feasible in real-world problems where the model is unknown.
- Limited Scalability: Struggles with high-dimensional state-action spaces (curse of dimensionality).
5.3 Comparison with Alternatives
Method | Strength | Weakness |
---|---|---|
MDP | Handles uncertainty, optimal policy guarantee | Computationally expensive |
Heuristic Methods | Fast, works well for specific tasks | No guarantee of optimality |
Reinforcement Learning (Q-learning) | Can learn without a known model | Requires large amounts of data |
Monte Carlo Methods | Useful for model-free environments | Slow convergence, high variance |
6. Basic Implementation
Below is a basic implementation of a Markov Decision Process (MDP) using Python. The implementation includes:
- Defining states, actions, rewards, and transition probabilities.
- Value iteration to find the optimal policy.
import numpy as np
# Define MDP components
states = ['S1', 'S2', 'S3']
actions = ['A1', 'A2']
transition_prob = {
'S1': {'A1': {'S1': 0.5, 'S2': 0.5}, 'A2': {'S2': 1.0}},
'S2': {'A1': {'S1': 0.2, 'S3': 0.8}, 'A2': {'S3': 1.0}},
'S3': {'A1': {'S3': 1.0}, 'A2': {'S3': 1.0}}
}
rewards = {
'S1': {'A1': 5, 'A2': 10},
'S2': {'A1': -1, 'A2': 2},
'S3': {'A1': 0, 'A2': 0}
}
gamma = 0.9 # Discount factor
threshold = 0.001 # Convergence threshold
# Initialize value function
value_function = {s: 0 for s in states}
def value_iteration():
while True:
delta = 0
new_value_function = value_function.copy()
for state in states:
max_value = float('-inf')
for action in actions:
expected_value = sum(
transition_prob[state][action][next_state] * (rewards[state][action] + gamma * value_function[next_state])
for next_state in transition_prob[state][action]
)
max_value = max(max_value, expected_value)
new_value_function[state] = max_value
delta = max(delta, abs(value_function[state] - max_value))
value_function.update(new_value_function)
if delta < threshold:
break
return value_function
# Run Value Iteration
optimal_values = value_iteration()
# Print the optimal value function
print("Optimal Value Function:")
for state, value in optimal_values.items():
print(f"V({state}) = {value:.4f}")
7. Dry Run: Step-by-Step Execution
7.1 Initial State
At the beginning, all state values are initialized to 0:
V(S1) = 0.0000
V(S2) = 0.0000
V(S3) = 0.0000
7.2 First Iteration
- For S1, evaluating action A1:
- Expected value: \( 0.5 \times (5 + 0.9 \times 0) + 0.5 \times (5 + 0.9 \times 0) = 5 \)
- For S1, evaluating action A2:
- Expected value: \( 1.0 \times (10 + 0.9 \times 0) = 10 \)
- Max value for S1: \( V(S1) = 10 \)
Similarly, calculations are performed for S2 and S3.
7.3 Intermediate Iterations
Each iteration updates the value function based on the Bellman equation. The values converge over multiple iterations:
Iteration 1:
V(S1) = 10.0000
V(S2) = 2.0000
V(S3) = 0.0000
Iteration 2:
V(S1) = 10.9000
V(S2) = 3.8000
V(S3) = 0.0000
Iteration 3:
V(S1) = 11.4200
V(S2) = 4.4200
V(S3) = 0.0000
...
Converges when delta < 0.001
7.4 Final Converged Values
After multiple iterations, the final value function is obtained:
V(S1) = 11.78
V(S2) = 5.32
V(S3) = 0.00
7.5 Interpretation
- S1 has the highest value, meaning the best expected future rewards start from here.
- S3 has a value of 0 since no further rewards exist.
- The agent will choose actions that maximize these values, leading to an optimal policy.
8. Time & Space Complexity Analysis
Markov Decision Processes (MDPs) are typically solved using algorithms like Value Iteration and Policy Iteration. The time and space complexity of these approaches depend on:
- \( S \) - Number of states
- \( A \) - Number of actions per state
- \( T \) - Number of iterations until convergence
8.1 Value Iteration Complexity
- Worst-case time complexity: \( O(S^2 A T) \) due to updating each state's value using transition probabilities.
- Best-case time complexity: \( O(S A T) \) if convergence happens quickly.
- Average-case time complexity: Depends on the discount factor \( \gamma \), but typically \( O(S^2 A T) \).
8.2 Policy Iteration Complexity
- Worst-case time complexity: \( O(S^3 A) \) due to solving a system of linear equations in policy evaluation.
- Best-case time complexity: \( O(S A) \) if the policy converges in a few iterations.
- Average-case time complexity: \( O(S^2 A) \) assuming moderate iterations.
8.3 Comparison of Complexity
Algorithm | Time Complexity | Best Use Case |
---|---|---|
Value Iteration | \( O(S^2 A T) \) | Small state spaces, faster convergence required |
Policy Iteration | \( O(S^3 A) \) | Larger state spaces, better stability |
9. Space Complexity Analysis
MDPs require storage for:
- State Values: \( O(S) \), as each state stores a value.
- Policy Storage: \( O(S) \), storing an action for each state.
- Transition Matrix: \( O(S^2 A) \), storing probabilities for state-action pairs.
- Reward Function: \( O(S A) \), mapping each (state, action) pair to a reward.
Thus, total space complexity is:
\( O(S^2 A) \) worst-case.
9.1 Space Consumption vs. Input Size
- For small MDPs: Space is manageable as state-action pairs are limited.
- For large MDPs: Memory can explode due to transition matrices growing as \( S^2 \).
- Solution: Use sparse representations or approximation methods (e.g., function approximation in reinforcement learning).
10. Trade-offs in MDP Algorithms
MDP solutions involve trade-offs between time, space, and accuracy.
10.1 Computational Trade-offs
- Value Iteration: Fast convergence but computationally expensive for large state spaces.
- Policy Iteration: Fewer iterations but each iteration is costly.
- Approximate Methods: Reduce complexity but may sacrifice accuracy.
10.2 Space vs. Accuracy
- Storing full transition matrices improves accuracy but consumes excessive memory.
- Sparse matrix representations reduce space usage at the cost of slower lookups.
10.3 Discount Factor \( \gamma \)
- Higher \( \gamma \) (e.g., 0.99) prioritizes long-term rewards but increases computation time.
- Lower \( \gamma \) (e.g., 0.5) converges faster but may ignore distant rewards.
10.4 Alternatives for Scalability
- Monte Carlo Methods: Require less memory but have high variance.
- Deep Q-Networks (DQN): Use neural networks to approximate MDP solutions.
- Hierarchical MDPs: Reduce complexity by decomposing large MDPs into smaller subproblems.
11. Optimizations & Variants
MDPs can be computationally expensive, but optimizations and algorithmic variants can significantly improve efficiency.
11.1 Common Optimizations
11.1.1 Adaptive Value Iteration
- Instead of updating all states in each iteration, update only states with significant changes.
- Reduces unnecessary computations and speeds up convergence.
11.1.2 Prioritized Sweeping
- Processes states with the highest estimated impact on policy first.
- Uses a priority queue to focus on the most relevant state transitions.
11.1.3 Asynchronous Value Iteration
- Updates values of states independently rather than synchronously.
- Reduces computational overhead while maintaining convergence guarantees.
11.1.4 Function Approximation
- For large MDPs, store value functions using neural networks instead of tables.
- Common in Deep Reinforcement Learning (DQN, PPO).
11.2 Variants of MDP Algorithms
11.2.1 Partially Observable MDP (POMDP)
- Used when the agent does not have full knowledge of the state.
- Maintains a belief state, which is a probability distribution over possible states.
- Example: Robotics with noisy sensors.
11.2.2 Average Reward MDP
- Focuses on maximizing long-term average rewards instead of discounted rewards.
- Useful in scenarios where discounting future rewards is undesirable.
- Example: Continuous production systems in manufacturing.
11.2.3 Hierarchical MDP (H-MDP)
- Breaks large problems into smaller sub-MDPs for efficiency.
- Improves scalability in large-scale decision-making.
- Example: Multi-level task planning in AI.
11.2.4 Factored MDP
- Uses structured representations for state variables instead of flat state spaces.
- Reduces computation for high-dimensional state-action spaces.
- Example: Multi-agent systems in traffic management.
12. Iterative vs. Recursive Implementations
MDP algorithms can be implemented iteratively or recursively, each with its trade-offs.
12.1 Iterative Approach
The standard approach in Value Iteration and Policy Iteration follows an iterative loop:
def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, threshold=0.001):
value_function = {s: 0 for s in states}
while True:
delta = 0
new_value_function = value_function.copy()
for state in states:
max_value = float('-inf')
for action in actions:
expected_value = sum(
transition_prob[state][action][next_state] *
(rewards[state][action] + gamma * value_function[next_state])
for next_state in transition_prob[state][action]
)
max_value = max(max_value, expected_value)
new_value_function[state] = max_value
delta = max(delta, abs(value_function[state] - max_value))
value_function.update(new_value_function)
if delta < threshold:
break
return value_function
Pros of Iterative Approach
- Efficient in memory usage (\(O(S)\)).
- Explicit control over stopping conditions.
- Can be optimized with heuristics (e.g., prioritized sweeping).
Cons of Iterative Approach
- Slower convergence for large MDPs.
- May require many iterations to reach an optimal policy.
12.2 Recursive Approach
Recursion allows an alternative way to compute value functions:
def value_iteration_recursive(state, states, actions, transition_prob, rewards, value_function, gamma=0.9, threshold=0.001):
max_value = float('-inf')
for action in actions:
expected_value = sum(
transition_prob[state][action][next_state] *
(rewards[state][action] + gamma * value_iteration_recursive(next_state, states, actions, transition_prob, rewards, value_function, gamma, threshold))
for next_state in transition_prob[state][action]
)
max_value = max(max_value, expected_value)
return max_value if abs(value_function[state] - max_value) > threshold else value_function[state]
Pros of Recursive Approach
- More readable and easier to express mathematical recurrence.
- Can be useful for dynamic programming with memoization.
Cons of Recursive Approach
- High memory overhead due to function call stack (\(O(S)\) recursive depth).
- May lead to stack overflow for large state spaces.
12.3 Iterative vs. Recursive Comparison
Method | Time Complexity | Space Complexity | Best Use Case |
---|---|---|---|
Iterative | \( O(S^2 A T) \) | \( O(S) \) | Large state spaces, stable convergence |
Recursive | \( O(S^2 A) \) (assuming memoization) | \( O(S) \) (stack depth) | Small state spaces, ease of implementation |
13. Edge Cases & Failure Handling
MDPs can fail or behave unexpectedly in various scenarios. Recognizing edge cases and handling failures is crucial.
13.1 Common Pitfalls & Edge Cases
13.1.1 Infinite Loops in Value Iteration
- If the stopping threshold is too small or improperly set, the algorithm may run indefinitely.
- Solution: Set a reasonable threshold (e.g., \( 10^{-3} \) to \( 10^{-6} \)) and limit iterations.
13.1.2 Zero Transition Probability
- If a state-action pair has zero probability of transitioning, the state may become unreachable.
- Solution: Ensure every state-action pair has at least one reachable state.
13.1.3 Negative Rewards & Infinite Loops
- MDPs with negative rewards can result in unbounded loops where the agent avoids termination states.
- Solution: Introduce absorbing states or discounting to prevent infinite exploitation of negative loops.
13.1.4 High Discount Factor (\( \gamma \)) Impact
- A high \( \gamma \) (e.g., 0.99) places excessive importance on long-term rewards, slowing convergence.
- Solution: Use a balanced discount factor (e.g., 0.9).
13.1.5 Unreachable Terminal States
- If certain terminal states have no path, the algorithm may keep looping in non-terminal states.
- Solution: Ensure all terminal states have feasible transitions.
14. Test Cases to Verify Correctness
To ensure correctness, MDP implementations should be tested with various input scenarios.
14.1 Basic Test Cases
14.1.1 Test Convergence
Ensure value iteration terminates within a reasonable number of iterations.
def test_convergence():
states = ['S1', 'S2']
actions = ['A1']
transition_prob = {
'S1': {'A1': {'S2': 1.0}},
'S2': {'A1': {'S1': 1.0}}
}
rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}
optimal_values = value_iteration(states, actions, transition_prob, rewards)
assert abs(optimal_values['S1'] - 10) < 0.1, "Convergence failed"
14.1.2 Test Unreachable State Handling
MDP should detect unreachable states and handle them correctly.
def test_unreachable_state():
states = ['S1', 'S2']
actions = ['A1']
transition_prob = {
'S1': {'A1': {'S2': 1.0}},
'S2': {'A1': {'S2': 1.0}} # S2 is an absorbing state
}
rewards = {'S1': {'A1': 1}, 'S2': {'A1': 2}}
optimal_values = value_iteration(states, actions, transition_prob, rewards)
assert optimal_values['S2'] == 2, "Unreachable state miscomputed"
14.1.3 Test Large State Space
Ensure performance does not degrade in large state spaces.
def test_large_state_space():
states = [f"S{i}" for i in range(100)]
actions = ['A1']
transition_prob = {s: {'A1': {s: 1.0}} for s in states}
rewards = {s: {'A1': i} for i, s in enumerate(states)}
optimal_values = value_iteration(states, actions, transition_prob, rewards)
assert optimal_values['S99'] == 99, "Large state space failed"
15. Real-World Failure Scenarios
15.1 Reinforcement Learning with Poor Reward Shaping
- Problem: If rewards are sparse, the agent may fail to learn a meaningful policy.
- Example: A robot learning to walk with rewards only at the final step may never reach it.
- Solution: Use reward shaping to provide intermediate rewards.
15.2 Stochastic Environments with Unmodeled Dynamics
- Problem: If real-world transition probabilities differ from those modeled, the policy may fail.
- Example: Self-driving cars using MDPs may struggle if road conditions are not accounted for.
- Solution: Use adaptive MDPs that update transition probabilities dynamically.
15.3 High Computational Costs in Large-Scale Problems
- Problem: MDPs with large state spaces become computationally intractable.
- Example: Multi-agent traffic systems modeling millions of vehicles.
- Solution: Use hierarchical or function approximation methods (e.g., Deep Q-Networks).
15.4 Policy Instability in Real-Time Systems
- Problem: If the MDP environment changes dynamically, policies may become invalid.
- Example: Dynamic pricing models in e-commerce failing due to sudden demand spikes.
- Solution: Implement online MDP learning with real-time adjustments.
16. Real-World Applications & Industry Use Cases
Markov Decision Processes (MDPs) are widely used in various industries to solve sequential decision-making problems under uncertainty.
16.1 Robotics & Automation
- Path Planning: Autonomous robots use MDPs for navigation in uncertain environments.
- Task Scheduling: Industrial robots determine the optimal sequence of actions to maximize efficiency.
16.2 Finance & Economics
- Portfolio Optimization: MDPs help investors determine the best strategy for buying/selling assets over time.
- Risk Management: Used in fraud detection, credit risk analysis, and algorithmic trading.
16.3 Healthcare
- Treatment Planning: MDPs optimize personalized treatment plans for chronic diseases.
- Drug Discovery: Helps model the impact of new drugs through clinical trials and simulations.
16.4 Artificial Intelligence & Gaming
- Reinforcement Learning (RL): Deep Q-learning uses MDPs as the foundation for AI training.
- Game AI: Video game NPCs use MDP-based strategies for decision-making.
16.5 Traffic Management & Smart Cities
- Adaptive Traffic Signals: MDPs optimize traffic light timings to reduce congestion.
- Autonomous Vehicles: Self-driving cars use MDPs for decision-making in dynamic environments.
17. Open-Source Implementations
Several open-source libraries and frameworks implement MDPs efficiently:
17.1 MDPToolbox (Python)
- Repository: MDPToolbox
- Features: Implements value iteration, policy iteration, and Q-learning.
17.2 OpenAI Gym
- Repository: OpenAI Gym
- Features: Provides reinforcement learning environments built on MDPs.
17.3 RLlib (Ray Reinforcement Learning)
- Repository: Ray RLlib
- Features: Scalable reinforcement learning framework using MDP principles.
18. Practical Project: AI-Based Traffic Signal Control
This project simulates a smart traffic light system using MDPs.
18.1 Problem Statement
Design an MDP-based system to optimize traffic signal timing at an intersection, minimizing wait times.
18.2 State Space
- State = {LowTraffic, MediumTraffic, HighTraffic}
18.3 Actions
- Actions = {ShortGreen, MediumGreen, LongGreen}
18.4 Reward Function
- Minimizing waiting time is rewarded.
- Heavy congestion receives penalties.
18.5 Python Implementation
import numpy as np
# Define MDP components
states = ['LowTraffic', 'MediumTraffic', 'HighTraffic']
actions = ['ShortGreen', 'MediumGreen', 'LongGreen']
# Transition probabilities
transition_prob = {
'LowTraffic': {'ShortGreen': {'LowTraffic': 0.8, 'MediumTraffic': 0.2},
'MediumGreen': {'LowTraffic': 0.9, 'MediumTraffic': 0.1},
'LongGreen': {'LowTraffic': 1.0}},
'MediumTraffic': {'ShortGreen': {'MediumTraffic': 0.7, 'HighTraffic': 0.3},
'MediumGreen': {'MediumTraffic': 0.6, 'LowTraffic': 0.4},
'LongGreen': {'LowTraffic': 1.0}},
'HighTraffic': {'ShortGreen': {'HighTraffic': 0.9, 'MediumTraffic': 0.1},
'MediumGreen': {'MediumTraffic': 0.8, 'LowTraffic': 0.2},
'LongGreen': {'LowTraffic': 1.0}}
}
# Reward function
rewards = {
'LowTraffic': {'ShortGreen': 5, 'MediumGreen': 10, 'LongGreen': 15},
'MediumTraffic': {'ShortGreen': -5, 'MediumGreen': 5, 'LongGreen': 10},
'HighTraffic': {'ShortGreen': -10, 'MediumGreen': -5, 'LongGreen': 5}
}
gamma = 0.9 # Discount factor
threshold = 0.001 # Convergence threshold
# Initialize value function
value_function = {s: 0 for s in states}
def value_iteration():
while True:
delta = 0
new_value_function = value_function.copy()
for state in states:
max_value = float('-inf')
for action in actions:
expected_value = sum(
transition_prob[state][action][next_state] *
(rewards[state][action] + gamma * value_function[next_state])
for next_state in transition_prob[state][action]
)
max_value = max(max_value, expected_value)
new_value_function[state] = max_value
delta = max(delta, abs(value_function[state] - max_value))
value_function.update(new_value_function)
if delta < threshold:
break
return value_function
# Run Value Iteration
optimal_values = value_iteration()
# Print the optimal value function
print("Optimal Traffic Signal Timing Policy:")
for state, value in optimal_values.items():
print(f"V({state}) = {value:.4f}")
18.6 Expected Outcome
- The system will determine the best green light duration based on traffic levels.
- It will minimize congestion and optimize traffic flow at intersections.
18.7 Extensions
- Integrate real-time traffic data.
- Use Reinforcement Learning to dynamically update MDP parameters.
19. Competitive Programming & System Design Integration
19.1 Using MDP in Competitive Programming
MDPs are useful in coding contests for problems that involve decision-making over time with uncertain outcomes. Key areas where MDPs are applicable:
- Grid Navigation Problems: Finding the optimal path in a probabilistic environment.
- Game Theory: Modeling moves in strategy games.
- Probability-based Optimization: Making the best decision when facing uncertain outcomes.
- AI Strategy Games: AI learning optimal policies for turn-based games.
Example Problem: Probabilistic Pathfinding
Given an \( N \times N \) grid where each cell has a probability of being blocked, find the optimal path to reach the destination with the highest probability of success.
19.2 System Design Integration
MDPs play a crucial role in designing large-scale systems where long-term decision-making is required. Some real-world use cases:
- Ad Recommendation Systems: Selecting ads to display based on user interactions over time.
- Autonomous Driving: Deciding lane changes, acceleration, and braking under uncertain traffic conditions.
- Cloud Resource Allocation: Allocating computing resources dynamically to minimize costs and maximize performance.
- Customer Retention Models: Predicting customer behavior and adjusting marketing strategies dynamically.
Example System Design Problem
Design a dynamic pricing model for an e-commerce platform using MDP. The system should:
- Determine the best price adjustments based on demand and competition.
- Maximize long-term revenue instead of immediate profits.
- Adapt dynamically as customer preferences evolve.
20. Assignments
20.1 Solve at Least 10 Problems Using MDP
Practice solving problems related to Markov Decision Processes. Recommended problems:
- Grid Navigation with Obstacles (Optimal Path Planning)
- Robot Movement in a Probabilistic Environment
- Game Strategy Optimization (Chess-like Move Selection)
- Stock Trading Policy Optimization
- Warehouse Inventory Management
- Optimal Stopping Problems (Casino Betting Strategies)
- Job Scheduling in Cloud Computing
- Traffic Signal Optimization
- Dynamic Pricing in E-commerce
- AI Learning to Play a Simple Game
20.2 Use MDP in a System Design Problem
Apply MDP principles to design a real-world system. Choose one of the following:
- Develop a chatbot that learns user preferences over time.
- Optimize ad display timing to maximize user engagement.
- Design a reinforcement learning model for self-driving cars.
- Implement a risk management system for investment portfolios.
20.3 Implement MDP Under Time Constraints
Simulate a competitive programming environment:
- Pick a problem from the list above.
- Set a strict time limit (e.g., 1 hour).
- Implement a working solution efficiently.
- Optimize for both accuracy and execution speed.
Bonus: Compare execution time between Value Iteration and Policy Iteration for large state spaces.