rl_gridworld.py - chunhualiao/public-docs GitHub Wiki
a much more complete program
https://gist.github.com/chunhualiao/34df94d1f1f8bc33a2745f6e58205753
Q-Learning Algorithm: How It Works
Q-Table
The agent has a Q-table, which is essentially a table that stores the "quality" of each action in each state. Initially, all values in the Q-table are zero.
Think of it like a cheat sheet where the agent will write down what it learns about each situation (state) and action.
Exploration vs. Exploitation
Exploration
- At the beginning, the agent explores the grid world by randomly choosing actions (up, down, left, right), even if it doesn't know if they are good or bad.
- This is controlled by the exploration rate ($\epsilon$), which starts high (1.0) and gradually decreases.
- Imagine the agent is trying to figure out the maze by randomly trying different paths.
Exploitation
- As the agent learns, it starts to exploit its knowledge.
- It looks at its Q-table and chooses the action that it currently believes is the best in the current state (the action with the highest Q-value).
Learning Process (Q-Learning Update)
In each step, the agent takes an action in a state and observes:
- reward: It gets a reward of 1 if it reaches the goal, and 0 otherwise.
- next_state: It moves to a new state.
Then, it updates its Q-table using the Q-learning formula:
$$Q(\text{state}, \text{action}) = Q(\text{state}, \text{action}) + \alpha \times \left( \text{reward} + \gamma \times \max(Q(\text{next state}, \text{all actions})) - Q(\text{state}, \text{action}) \right)$$
Breaking Down the Formula
- $Q(\text{state}, \text{action})$: The current Q-value for the state-action pair we just took.
- $\alpha$ (learning rate): Controls how much we update the Q-value in each step. A smaller value means slower learning.
- reward: The immediate reward received (0 or 1 in this case).
- $\gamma$ (discount factor): Determines how much importance we give to future rewards.
- A value close to 1 means we consider future rewards heavily.
- $\max(Q(\text{next state}, \text{all actions}))$: The maximum Q-value for all possible actions in the next state.
- Represents the best expected future reward from the next state.
- predict: The current guess of how good this action is in this state.
$$\text{predict} = Q(\text{state}, \text{action})$$
- target: The improved guess of how good this action should be, considering the immediate reward and the best future reward from the next state.
$$\text{target} = \text{reward} + \gamma \times \max(Q(\text{next state}, \text{all actions}))$$
- Q-value update: Adjusts the Q-value to move it closer to the target.
$\text{self.q_table[state, action]} = \text{self.q_table[state, action]} + \alpha \times (\text{target} - \text{predict})$
- The difference ($\text{target} - \text{predict}$) is the "error" in our current guess.
- We adjust our guess in the direction of this error, scaled by the learning rate.
Exploration Decay
Over time, the exploration rate decreases (using exploration_decay_rate
).
This means the agent starts to explore less and exploit its learned Q-values more as it becomes more confident in its knowledge.
In Simple Terms
Imagine the agent is trying to find the best path in a maze:
-
Exploration Phase:
- It starts by randomly trying different directions.
- If it gets closer to the goal (receives a reward), it remembers that direction was good and increases the "goodness" value (Q-value) for that action in that location (state) in its cheat sheet (Q-table).
-
Updating Knowledge:
- If it moves in a direction and doesn't get a reward, or moves further away, it slightly decreases the "goodness" value for that direction.
-
Exploitation Phase:
- As it keeps trying, it updates its cheat sheet based on the rewards it gets.
- Eventually, it starts using its cheat sheet to choose the directions it believes are best to reach the goal (exploitation), because those directions have higher "goodness" values in its cheat sheet.
-
Optimal Path Learning:
- Over time, it becomes more and more likely to choose the best path as its cheat sheet (Q-table) becomes more accurate.