SARSA - chunhualiao/public-docs GitHub Wiki
What is SARSA?
SARSA is an on-policy reinforcement learning (RL) algorithm used to learn an optimal policy for decision-making problems. The name SARSA stands for the sequence of events it uses in its update rule:
- S → Current state (s)
- A → Current action (a)
- R → Reward received after taking action (r)
- S' → Next state (s')
- A' → Next action (a') chosen according to the current policy
Thus, SARSA follows the pattern:
(S, A, R, S', A')
What Does SARSA Do?
SARSA helps an agent learn the best policy (a set of rules mapping states to actions) by iteratively updating Q-values based on the agent's actual experience while interacting with the environment.
How Does SARSA Work?
SARSA learns a Q-value function Q(s, a) that estimates the expected cumulative reward of taking an action a in a state s and following the learned policy afterward.
The process follows these steps:
- Initialize the Q-table Q(s, a) with arbitrary values.
- Choose an action A in state S using an ε-greedy policy (exploration vs. exploitation).
- Take action A, observe reward R and next state S'.
- Choose next action A' using the same ε-greedy policy.
- Update the Q-value for Q(s, a) using the SARSA update rule:
Q(s, a) = Q(s, a) + α[R + γQ(s', a') - Q(s, a)]
where:
- α (learning rate) controls how much new information overrides the old value.
- γ (discount factor) determines the importance of future rewards.
- Q(s', a') is the estimated future value based on the next action a'.
- Repeat until convergence (i.e., Q-values stabilize).
The Math Behind SARSA
SARSA updates Q-values using the Bellman Equation:
Q(s, a) = Q(s, a) + α[R + γQ(s', a') - Q(s, a)]
Breaking it Down:
- Q(s, a): Current Q-value (expected return for taking action a in state s).
- R: Immediate reward received after taking action a.
- Q(s', a'): Estimated Q-value for the next state-action pair, following the same policy.
- R + γQ(s', a'): The total expected reward considering both immediate and future rewards.
- R + γQ(s', a') - Q(s, a): The TD error (difference between old estimate and new estimate).
- α scales the update size.
TD: temporal difference
Example: SARSA in a Grid World
Imagine a 4×4 grid world where an agent navigates from Start (S) to Goal (G) while avoiding obstacles.
S . . .
. X . .
. . X .
. . . G
- S: Start state
- G: Goal state (reward = +1)
- X: Obstacles (reward = -1)
- .: Empty spaces (reward = 0)
- Agent can move Up, Down, Left, Right.
1. Initialization
- The agent initializes the Q-table Q(s, a) with arbitrary values.
- It selects actions using an ε-greedy policy.
2. Agent's First Step
- The agent starts at S and selects an action A (e.g., Right).
- It moves Right, landing in a new state S'.
- The environment gives a reward R = 0 (no obstacle, not the goal).
- The agent chooses the next action A' (e.g., Down).
3. Q-value Update
Using the SARSA update rule:
Q(S, Right) = Q(S, Right) + α[0 + γQ(S', Down) - Q(S, Right)]
- If α = 0.1 and γ = 0.9, then:
- Suppose Q(S, Right) = 0.5
- Suppose Q(S', Down) = 0.6
- Then:
Q(S, Right) = 0.5 + 0.1 × (0 + 0.9 × 0.6 - 0.5) = 0.5 + 0.1 × (0.54 - 0.5) = 0.5 + 0.1 × 0.04 = 0.504
Over multiple iterations, the Q-values converge, and the agent learns an optimal policy while considering exploration.
Key Differences Between SARSA and Q-Learning
Feature | SARSA (On-Policy) | Q-Learning (Off-Policy) |
---|---|---|
Update Rule | Uses the actual next action a' chosen by the policy. | Uses the best possible next action max Q(s', a'). |
Exploration Impact | Considers the effect of exploration. | Assumes optimal actions, ignoring exploration. |
Behavior | Learns a safer policy, avoiding risky actions. | Learns the absolute optimal policy, even if risky. |
Best For | Stochastic environments (uncertainty, real-world applications). | Deterministic environments (clear optimal strategy). |
Conclusion
SARSA is a reinforcement learning algorithm that:
- Learns on-policy, meaning it considers the actual policy being followed.
- Uses a TD-learning approach to update Q-values based on the next action it actually takes.
- Tends to learn safer policies compared to Q-learning because it accounts for exploration.
- Works well in stochastic environments where future actions may not always be optimal.
In short, SARSA is best when the environment is uncertain and exploration matters, whereas Q-learning is better when aiming for the absolute optimal policy in deterministic settings.
Comparing Q-learning and SARSA for Grid World Navigation
Comparison Table
Feature | Q-learning (Off-Policy) | SARSA (On-Policy) |
---|---|---|
Update Rule | $( Q(s, a) = Q(s, a) + \alpha [R + \gamma \max(Q(s', a')) - Q(s, a)] )$ | $( Q(s, a) = Q(s, a) + \alpha [R + \gamma Q(s', a') - Q(s, a)] )$ |
Learning Type | Off-Policy | On-Policy |
Next Action (for update) | Uses the maximum possible Q-value in the next state ( \max(Q(s', a')) ), regardless of the action actually taken. | Uses the Q-value of the action actually taken in the next state ( Q(s', a') ), following the current policy. |
Policy Learned | Learns the optimal policy directly ((\arg\max Q(s, a))). | Learns a policy that is close to optimal while following the exploration policy. |
Convergence | Generally converges to the optimal policy in deterministic environments. | Converges to a policy that is optimal given the exploration policy. May not always reach the absolute optimal policy if exploration is significant. |
Behavior in Stochastic Environments | Can be more aggressive and may learn a policy that is optimal in expectation but risky in stochastic environments. | Tends to be more cautious and learns a safer policy, considering the exploration policy. |
Use Case | When you want to find the absolute optimal policy, regardless of the exploration strategy. | When you want to learn a policy that is safe and reflects the agent's actual behavior, especially when exploration is important or the environment is stochastic. |
Explanation of Key Differences
Off-Policy vs. On-Policy
-
Q-learning (Off-Policy):
- It’s called "off-policy" because it learns about the optimal policy regardless of the actions the agent actually takes.
- It looks ahead to the best possible action in the next state ((\max(Q(s', a')))) to update its Q-value, assuming it will always take the optimal action in the future.
-
SARSA (On-Policy):
- It’s "on-policy" because it learns about the policy it is currently following.
- It uses the Q-value of the action actually taken in the next state ((Q(s', a'))) to update its Q-value.
- This means SARSA’s learning is influenced by its exploration strategy.
Impact on Learning
-
Q-learning:
- The agent always aims for the best possible future reward, which can make it learn the theoretically optimal policy faster in deterministic environments.
- However, in stochastic environments or when exploration is noisy, it might learn a policy that is too aggressive or risky because it assumes optimal future actions.
-
SARSA:
- The agent learns to consider the exploration policy.
- If the exploration policy sometimes takes random actions, SARSA will learn a safer policy, avoiding actions that might lead to bad outcomes during exploration.
- It might converge to a slightly less optimal but safer policy compared to Q-learning, especially in scenarios with uncertainty.
In Grid World Navigation
- For a deterministic grid world like in our example:
- Both Q-learning and SARSA will likely find a path to the goal.
- Q-learning might learn to take more direct routes, assuming it will always be able to execute the optimal action.
- SARSA might learn a slightly more conservative route if its exploration policy leads it to sometimes take suboptimal actions. It learns to account for the possibility of not always taking the best action due to exploration.
Summary
- Q-learning aims for the absolute best policy, assuming optimal future actions.
- SARSA learns a policy that is optimal given the agent’s actual behavior (including exploration).
- SARSA is often considered more suitable for real-world scenarios where exploration and safety are important.
- Q-learning can be more efficient in finding the theoretical optimum in simpler, deterministic environments.