Markov decision process - chunhualiao/public-docs GitHub Wiki
A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formal way to describe sequential decision problems where an agent interacts with an environment.
MDP Components
An MDP is defined by the tuple (S, A, P, R, γ):
- S (State Space): A finite or infinite set of possible states of the environment.
- A (Action Space): A finite or infinite set of possible actions that the agent can take.
- P (Transition Probability): A function ( P(s' | s, a) ) that defines the probability of transitioning from state ( s ) to state ( s' ) given that action ( a ) was taken.
- R (Reward Function): A function ( R(s, a) ) that defines the immediate reward received after taking action ( a ) in state ( s ).
- γ (Discount Factor): A parameter ( \gamma \in [0,1] ) that determines how much future rewards are valued compared to immediate rewards.
MDP Process
- The agent starts in a state ( s ).
- The agent chooses an action ( a ) from the available action space.
- The environment responds by transitioning the agent to a new state ( s' ) according to ( P(s' | s, a) ).
- The agent receives a reward ( R(s, a) ).
- The process repeats, forming a trajectory of states, actions, and rewards.
Goal of MDP
The objective in an MDP is to find an optimal policy ( \pi^* ), which is a mapping from states to actions that maximizes the expected cumulative reward over time:
V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]
where ( V^\pi(s) ) is the expected return when following policy ( \pi ) from state ( s ).
Applications
MDPs are widely used in:
- Reinforcement Learning (RL): Where agents learn optimal policies through trial and error.
- Robotics: For autonomous navigation and planning.
- Finance: To model decision-making under uncertainty.
- Healthcare: For optimal treatment planning.
- Game AI: To optimize strategies in games.
Would you like a more detailed explanation of a specific aspect, like solving MDPs or their applications in reinforcement learning?
How to determine
Here's how to determine if a problem is a Markov Decision Process (MDP) and whether Q-learning applies, with concrete examples:
Key Criteria for MDP (Q-learning applicability)
-
Markov Property:
The next state and reward depend only on the current state and action, not on previous states/actions. -
Fully Observable State:
The agent has complete information about the current state. -
Discrete Time Steps:
The problem can be divided into clear decision points. -
Defined Actions/States:
There's a finite set of possible actions and identifiable states.
Positive Examples (MDP Problems)
-
Grid World Navigation
✅ Why MDP:- Current position fully describes the state
- Next position depends only on current location + chosen movement
- Clear rewards (e.g., +10 for reaching goal)
-
Chess Game AI
✅ Why MDP:- Current board position fully defines the state
- Next state depends only on current pieces + move chosen
- Reward = checkmate (terminal state)
-
Inventory Management
✅ Why MDP:- State = (current stock levels, known demand)
- Next state depends only on current stock + restocking action
- Reward = profit from sales
-
Robot Path Planning
✅ Why MDP:- State = (current coordinates, sensor readings)
- Next position depends on current location + movement command
- Reward = reaching destination
Negative Examples (Non-MDP Problems)
-
Poker Game
❌ Why Not MDP:- Hidden information (opponents' cards) breaks Markov property
- Need to infer hidden states from history (→ POMDP)
- Requires memory of past actions/bets
-
Real-Time Strategy Games
❌ Why Not MDP:- Partial observability ("fog of war")
- Next state depends on hidden enemy units/actions
- Requires state estimation beyond current observations
-
Stock Trading
❌ Why Not MDP (usually):- True state includes hidden market factors (e.g., investor sentiment)
- Reward depends on future prices (non-Markovian dynamics)
- Often requires handling continuous states
-
Customer Service Chatbot
❌ Why Not MDP (basic implementations):- State should include conversation history
- User intent often hidden (→ POMDP)
- Next response depends on entire dialogue context
How to Test for MDP
-
State Completeness Test:
Ask: "Does the current state contain ALL information needed to predict the next state and reward?"- Yes → Markovian
- No → Non-Markovian
-
History Independence Test:
If two different histories lead to the same state, do they have:- Same transition probabilities?
- Same expected rewards?
- If yes → Markovian
When Q-Learning Might Still Work
Even if a problem isn't perfectly Markovian, Q-learning can sometimes be effective with:
- State Engineering: Add historical information to the state
(e.g., include last 3 sensor readings in robot navigation) - Approximation: Use neural networks (Deep Q-Networks) to handle partial observability
- Domain Simplification: Treat partially observable problems as MDPs (common in practice)
Summary Table
Scenario | MDP? | Why? | Q-learning Applicable? |
---|---|---|---|
Chess | Yes | Full board state observable | ✅ Yes |
Poker | No | Hidden cards | ❌ Needs POMDP |
Drone Navigation | Yes | GPS + sensors give full state | ✅ Yes |
Weather Prediction | No | Depends on past trends | ❌ Use RNNs/LSTMs |
Elevator Control | Yes | Current floor + requests | ✅ Yes |
Medical Diagnosis | No | Hidden symptoms/patient history | ❌ Needs POMDP |
This framework helps you assess whether Q-learning is appropriate or if you need more advanced techniques (like Deep Q-Networks or POMDP solvers).