Markov decision process - chunhualiao/public-docs GitHub Wiki

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. It provides a formal way to describe sequential decision problems where an agent interacts with an environment.

MDP Components

An MDP is defined by the tuple (S, A, P, R, γ):

  1. S (State Space): A finite or infinite set of possible states of the environment.
  2. A (Action Space): A finite or infinite set of possible actions that the agent can take.
  3. P (Transition Probability): A function ( P(s' | s, a) ) that defines the probability of transitioning from state ( s ) to state ( s' ) given that action ( a ) was taken.
  4. R (Reward Function): A function ( R(s, a) ) that defines the immediate reward received after taking action ( a ) in state ( s ).
  5. γ (Discount Factor): A parameter ( \gamma \in [0,1] ) that determines how much future rewards are valued compared to immediate rewards.

MDP Process

  1. The agent starts in a state ( s ).
  2. The agent chooses an action ( a ) from the available action space.
  3. The environment responds by transitioning the agent to a new state ( s' ) according to ( P(s' | s, a) ).
  4. The agent receives a reward ( R(s, a) ).
  5. The process repeats, forming a trajectory of states, actions, and rewards.

Goal of MDP

The objective in an MDP is to find an optimal policy ( \pi^* ), which is a mapping from states to actions that maximizes the expected cumulative reward over time:

V^\pi(s) = \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) \right]

where ( V^\pi(s) ) is the expected return when following policy ( \pi ) from state ( s ).

Applications

MDPs are widely used in:

  • Reinforcement Learning (RL): Where agents learn optimal policies through trial and error.
  • Robotics: For autonomous navigation and planning.
  • Finance: To model decision-making under uncertainty.
  • Healthcare: For optimal treatment planning.
  • Game AI: To optimize strategies in games.

Would you like a more detailed explanation of a specific aspect, like solving MDPs or their applications in reinforcement learning?

How to determine

Here's how to determine if a problem is a Markov Decision Process (MDP) and whether Q-learning applies, with concrete examples:


Key Criteria for MDP (Q-learning applicability)

  1. Markov Property:
    The next state and reward depend only on the current state and action, not on previous states/actions.

  2. Fully Observable State:
    The agent has complete information about the current state.

  3. Discrete Time Steps:
    The problem can be divided into clear decision points.

  4. Defined Actions/States:
    There's a finite set of possible actions and identifiable states.


Positive Examples (MDP Problems)

  1. Grid World Navigation
    Why MDP:

    • Current position fully describes the state
    • Next position depends only on current location + chosen movement
    • Clear rewards (e.g., +10 for reaching goal)
  2. Chess Game AI
    Why MDP:

    • Current board position fully defines the state
    • Next state depends only on current pieces + move chosen
    • Reward = checkmate (terminal state)
  3. Inventory Management
    Why MDP:

    • State = (current stock levels, known demand)
    • Next state depends only on current stock + restocking action
    • Reward = profit from sales
  4. Robot Path Planning
    Why MDP:

    • State = (current coordinates, sensor readings)
    • Next position depends on current location + movement command
    • Reward = reaching destination

Negative Examples (Non-MDP Problems)

  1. Poker Game
    Why Not MDP:

    • Hidden information (opponents' cards) breaks Markov property
    • Need to infer hidden states from history (→ POMDP)
    • Requires memory of past actions/bets
  2. Real-Time Strategy Games
    Why Not MDP:

    • Partial observability ("fog of war")
    • Next state depends on hidden enemy units/actions
    • Requires state estimation beyond current observations
  3. Stock Trading
    Why Not MDP (usually):

    • True state includes hidden market factors (e.g., investor sentiment)
    • Reward depends on future prices (non-Markovian dynamics)
    • Often requires handling continuous states
  4. Customer Service Chatbot
    Why Not MDP (basic implementations):

    • State should include conversation history
    • User intent often hidden (→ POMDP)
    • Next response depends on entire dialogue context

How to Test for MDP

  1. State Completeness Test:
    Ask: "Does the current state contain ALL information needed to predict the next state and reward?"

    • Yes → Markovian
    • No → Non-Markovian
  2. History Independence Test:
    If two different histories lead to the same state, do they have:

    • Same transition probabilities?
    • Same expected rewards?
    • If yes → Markovian

When Q-Learning Might Still Work

Even if a problem isn't perfectly Markovian, Q-learning can sometimes be effective with:

  1. State Engineering: Add historical information to the state
    (e.g., include last 3 sensor readings in robot navigation)
  2. Approximation: Use neural networks (Deep Q-Networks) to handle partial observability
  3. Domain Simplification: Treat partially observable problems as MDPs (common in practice)

Summary Table

Scenario MDP? Why? Q-learning Applicable?
Chess Yes Full board state observable ✅ Yes
Poker No Hidden cards ❌ Needs POMDP
Drone Navigation Yes GPS + sensors give full state ✅ Yes
Weather Prediction No Depends on past trends ❌ Use RNNs/LSTMs
Elevator Control Yes Current floor + requests ✅ Yes
Medical Diagnosis No Hidden symptoms/patient history ❌ Needs POMDP

This framework helps you assess whether Q-learning is appropriate or if you need more advanced techniques (like Deep Q-Networks or POMDP solvers).