rl_gridworld.py - chunhualiao/public-docs GitHub Wiki

a much more complete program

https://gist.github.com/chunhualiao/88df64982476dcaf4ea59a5791eefe27

https://gist.github.com/chunhualiao/34df94d1f1f8bc33a2745f6e58205753

Q learning algorithm

Q-Learning Algorithm: How It Works

Q-Table

The agent has a Q-table, which is essentially a table that stores the "quality" of each action in each state. Initially, all values in the Q-table are zero.

Think of it like a cheat sheet where the agent will write down what it learns about each situation (state) and action.

Exploration vs. Exploitation

Exploration

At the beginning, the agent explores the grid world by randomly choosing actions (up, down, left, right), even if it doesn't know if they are good or bad.
This is controlled by the exploration rate ($\epsilon$), which starts high (1.0) and gradually decreases.
Imagine the agent is trying to figure out the maze by randomly trying different paths.

Exploitation

As the agent learns, it starts to exploit its knowledge.
It looks at its Q-table and chooses the action that it currently believes is the best in the current state (the action with the highest Q-value).

Learning Process (Q-Learning Update)

In each step, the agent takes an action in a state and observes:

reward: It gets a reward of 1 if it reaches the goal, and 0 otherwise.
next_state: It moves to a new state.

Then, it updates its Q-table using the Q-learning formula:

$$Q(\text{state}, \text{action}) = Q(\text{state}, \text{action}) + \alpha \times \left( \text{reward} + \gamma \times \max(Q(\text{next state}, \text{all actions})) - Q(\text{state}, \text{action}) \right)$$

Breaking Down the Formula

$Q(\text{state}, \text{action})$: The current Q-value for the state-action pair we just took.
$\alpha$ (learning rate): Controls how much we update the Q-value in each step. A smaller value means slower learning.
reward: The immediate reward received (0 or 1 in this case).
$\gamma$ (discount factor): Determines how much importance we give to future rewards.
- A value close to 1 means we consider future rewards heavily.
$\max(Q(\text{next state}, \text{all actions}))$: The maximum Q-value for all possible actions in the next state.
- Represents the best expected future reward from the next state.
predict: The current guess of how good this action is in this state.

$$\text{predict} = Q(\text{state}, \text{action})$$

target: The improved guess of how good this action should be, considering the immediate reward and the best future reward from the next state.

$$\text{target} = \text{reward} + \gamma \times \max(Q(\text{next state}, \text{all actions}))$$

Q-value update: Adjusts the Q-value to move it closer to the target.

$\text{self.q_table[state, action]} = \text{self.q_table[state, action]} + \alpha \times (\text{target} - \text{predict})$

The difference ($\text{target} - \text{predict}$) is the "error" in our current guess.
We adjust our guess in the direction of this error, scaled by the learning rate.

Exploration Decay

Over time, the exploration rate decreases (using exploration_decay_rate).
This means the agent starts to explore less and exploit its learned Q-values more as it becomes more confident in its knowledge.

In Simple Terms

Imagine the agent is trying to find the best path in a maze:

Exploration Phase:
- It starts by randomly trying different directions.
- If it gets closer to the goal (receives a reward), it remembers that direction was good and increases the "goodness" value (Q-value) for that action in that location (state) in its cheat sheet (Q-table).
Updating Knowledge:
- If it moves in a direction and doesn't get a reward, or moves further away, it slightly decreases the "goodness" value for that direction.
Exploitation Phase:
- As it keeps trying, it updates its cheat sheet based on the rewards it gets.
- Eventually, it starts using its cheat sheet to choose the directions it believes are best to reach the goal (exploitation), because those directions have higher "goodness" values in its cheat sheet.
Optimal Path Learning:
- Over time, it becomes more and more likely to choose the best path as its cheat sheet (Q-table) becomes more accurate.