Markov decision process - chunhualiao/public-docs GitHub Wiki

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formal way to describe sequential decision problems where an agent interacts with an environment.

MDP Components

An MDP is defined by the tuple (S, A, P, R, γ):

S (State Space): A finite or infinite set of possible states of the environment.
A (Action Space): A finite or infinite set of possible actions that the agent can take.
P (Transition Probability): A function ( P(s' | s, a) ) that defines the probability of transitioning from state ( s ) to state ( s' ) given that action ( a ) was taken.
R (Reward Function): A function ( R(s, a) ) that defines the immediate reward received after taking action ( a ) in state ( s ).
γ (Discount Factor): A parameter ( \gamma \in [0,1] ) that determines how much future rewards are valued compared to immediate rewards.

MDP Process

The agent starts in a state ( s ).
The agent chooses an action ( a ) from the available action space.
The environment responds by transitioning the agent to a new state ( s' ) according to ( P(s' | s, a) ).
The agent receives a reward ( R(s, a) ).
The process repeats, forming a trajectory of states, actions, and rewards.

Goal of MDP

The objective in an MDP is to find an optimal policy ( \pi^* ), which is a mapping from states to actions that maximizes the expected cumulative reward over time:

V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]

where ( V^\pi(s) ) is the expected return when following policy ( \pi ) from state ( s ).

Applications

MDPs are widely used in:

Reinforcement Learning (RL): Where agents learn optimal policies through trial and error.
Robotics: For autonomous navigation and planning.
Finance: To model decision-making under uncertainty.
Healthcare: For optimal treatment planning.
Game AI: To optimize strategies in games.

Would you like a more detailed explanation of a specific aspect, like solving MDPs or their applications in reinforcement learning?

How to determine

Here's how to determine if a problem is a Markov Decision Process (MDP) and whether Q-learning applies, with concrete examples:

Key Criteria for MDP (Q-learning applicability)

Markov Property:
The next state and reward depend only on the current state and action, not on previous states/actions.
Fully Observable State:
The agent has complete information about the current state.
Discrete Time Steps:
The problem can be divided into clear decision points.
Defined Actions/States:
There's a finite set of possible actions and identifiable states.

Positive Examples (MDP Problems)

Grid World Navigation
✅ Why MDP:
- Current position fully describes the state
- Next position depends only on current location + chosen movement
- Clear rewards (e.g., +10 for reaching goal)
Chess Game AI
✅ Why MDP:
- Current board position fully defines the state
- Next state depends only on current pieces + move chosen
- Reward = checkmate (terminal state)
Inventory Management
✅ Why MDP:
- State = (current stock levels, known demand)
- Next state depends only on current stock + restocking action
- Reward = profit from sales
Robot Path Planning
✅ Why MDP:
- State = (current coordinates, sensor readings)
- Next position depends on current location + movement command
- Reward = reaching destination

Negative Examples (Non-MDP Problems)

Poker Game
❌ Why Not MDP:
- Hidden information (opponents' cards) breaks Markov property
- Need to infer hidden states from history (→ POMDP)
- Requires memory of past actions/bets
Real-Time Strategy Games
❌ Why Not MDP:
- Partial observability ("fog of war")
- Next state depends on hidden enemy units/actions
- Requires state estimation beyond current observations
Stock Trading
❌ Why Not MDP (usually):
- True state includes hidden market factors (e.g., investor sentiment)
- Reward depends on future prices (non-Markovian dynamics)
- Often requires handling continuous states
Customer Service Chatbot
❌ Why Not MDP (basic implementations):
- State should include conversation history
- User intent often hidden (→ POMDP)
- Next response depends on entire dialogue context

How to Test for MDP

State Completeness Test:
Ask: "Does the current state contain ALL information needed to predict the next state and reward?"
- Yes → Markovian
- No → Non-Markovian
History Independence Test:
If two different histories lead to the same state, do they have:
- Same transition probabilities?
- Same expected rewards?
- If yes → Markovian

When Q-Learning Might Still Work

Even if a problem isn't perfectly Markovian, Q-learning can sometimes be effective with:

State Engineering: Add historical information to the state
(e.g., include last 3 sensor readings in robot navigation)
Approximation: Use neural networks (Deep Q-Networks) to handle partial observability
Domain Simplification: Treat partially observable problems as MDPs (common in practice)

Summary Table

Scenario	MDP?	Why?	Q-learning Applicable?
Chess	Yes	Full board state observable	✅ Yes
Poker	No	Hidden cards	❌ Needs POMDP
Drone Navigation	Yes	GPS + sensors give full state	✅ Yes
Weather Prediction	No	Depends on past trends	❌ Use RNNs/LSTMs
Elevator Control	Yes	Current floor + requests	✅ Yes
Medical Diagnosis	No	Hidden symptoms/patient history	❌ Needs POMDP

This framework helps you assess whether Q-learning is appropriate or if you need more advanced techniques (like Deep Q-Networks or POMDP solvers).