ForagerRL_step5 - gama-platform/gama GitHub Wiki
By Killian Trouillet
This is the key step. We replace random movement with the Q-Learning algorithm. The agent now makes decisions based on its Q-Table and improves its policy after every step using the Bellman equation.
- Definition of RL hyperparameters: learning rate (α), discount factor (γ), epsilon (ε)
- Implementation of epsilon-greedy action selection
- Implementation of the Q-Learning update rule (Bellman equation)
- Decay of epsilon over episodes (exploration → exploitation)
We define the RL parameters as global variables so they can be adjusted via the experiment interface:
global {
// ... (previous variables)
float alpha <- 0.1; // Learning rate: how fast the agent learns
float gamma_rl <- 0.95; // Discount factor: importance of future rewards
float epsilon <- 1.0; // Exploration rate: probability of random action
float epsilon_min <- 0.01; // Minimum exploration rate
float epsilon_decay <- 0.995; // Multiply epsilon by this after each episode
- α (alpha): Controls how much new information overrides old information. Higher = faster learning but less stable.
- γ (gamma_rl): How much the agent values future rewards vs. immediate ones. 0.95 means future rewards are almost as important as immediate ones.
- ε (epsilon): The probability of taking a random action instead of the best known action. Starts at 1.0 (100% random) and decays over time.
This is the core decision-making function. With probability ε, the agent explores (picks a random action). Otherwise, it exploits (picks the action with the highest Q-value):
string choose_action {
string s <- get_state();
// Explore: random action
if (flip(epsilon)) {
return action_list[rnd(3)];
}
// Exploit: best known action
string best_action <- action_list[0];
float best_q <- get_q(s, action_list[0]);
loop a over: action_list {
float q_val <- get_q(s, a);
if (q_val > best_q) {
best_q <- q_val;
best_action <- a;
}
}
return best_action;
}
-
flip(epsilon): Returnstruewith probabilityepsilon.
After every action, we update the Q-value for the (state, action) pair we just used. The formula is:
Q(s,a) ← Q(s,a) + α × [r + γ × max Q(s', a') − Q(s,a)]
In GAML:
action update_q_value (string s, string a, float r, string s_next) {
float old_q <- get_q(s, a);
// Find the max Q-value achievable from the new state
float max_next_q <- get_q(s_next, action_list[0]);
loop act over: action_list {
float q_val <- get_q(s_next, act);
if (q_val > max_next_q) {
max_next_q <- q_val;
}
}
// Apply the Bellman equation
float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
q_table[s + "::" + a] <- new_q;
}
The random_move reflex becomes act, using the Q-Learning pipeline:
reflex act {
string current_state <- get_state();
string action_taken <- choose_action();
// ... (movement and reward logic as before)
string new_state <- get_state();
do update_q_value(current_state, action_taken, step_reward, new_state);
episode_reward <- episode_reward + step_reward;
}
At the end of each episode, we reduce epsilon to make the agent explore less and exploit more:
// In end_episode action:
if (epsilon > epsilon_min) {
epsilon <- epsilon * epsilon_decay;
}
We expose the hyperparameters in the GUI so users can experiment:
experiment smart_forager type: gui {
parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
// ...
/**
* Name: SmartForager - Step 5: Q-Learning Algorithm
* Author: Killian Trouillet
* Description: This fifth step replaces random movement with the Q-Learning algorithm.
* The forager now uses epsilon-greedy action selection and updates its Q-Table
* after each step using the Bellman equation.
* Tags: reinforcement-learning, q-learning, epsilon-greedy, tutorial
*/
model SmartForager
global {
int grid_size <- 10;
int food_x <- 9;
int food_y <- 9;
list<point> obstacle_positions <- [{2,2}, {3,2}, {2,3}, {6,4}, {7,4}, {7,5}];
int max_steps_per_episode <- 200;
int step_count <- 0;
int episode <- 0;
float episode_reward <- 0.0;
float last_episode_reward <- 0.0;
bool food_found <- false;
// === NEW: RL Hyperparameters ===
float alpha <- 0.1; // Learning rate
float gamma_rl <- 0.95; // Discount factor
float epsilon <- 1.0; // Exploration rate (starts high)
float epsilon_min <- 0.01; // Minimum exploration rate
float epsilon_decay <- 0.995; // Decay factor per episode
init {
ask world_cell grid_at {food_x, food_y} {
is_food <- true;
}
loop pos over: obstacle_positions {
ask world_cell grid_at pos {
is_obstacle <- true;
}
}
create forager number: 1 {
my_cell <- world_cell grid_at {0, 0};
location <- my_cell.location;
}
}
reflex manage_episode {
step_count <- step_count + 1;
if (food_found or step_count >= max_steps_per_episode) {
do end_episode;
}
}
action end_episode {
episode <- episode + 1;
last_episode_reward <- episode_reward;
write "Ep " + episode + " | Steps: " + step_count
+ " | Reward: " + round(episode_reward)
+ " | Eps: " + (epsilon with_precision 3)
+ " | Q-size: " + length(forager[0].q_table);
episode_reward <- 0.0;
step_count <- 0;
food_found <- false;
// Decay epsilon
if (epsilon > epsilon_min) {
epsilon <- epsilon * epsilon_decay;
}
ask world_cell grid_at {food_x, food_y} {
is_food <- true;
}
ask forager[0] {
my_cell <- world_cell grid_at {0, 0};
location <- my_cell.location;
}
}
}
grid world_cell width: 10 height: 10 neighbors: 4 {
bool is_food <- false;
bool is_obstacle <- false;
rgb color <- #white update: is_obstacle ? rgb(60, 60, 60) : #white;
}
species forager {
world_cell my_cell;
map<string, float> q_table;
list<string> action_list <- ["up", "right", "down", "left"];
string get_state {
return string(my_cell.grid_x) + "_" + string(my_cell.grid_y);
}
float get_q (string s, string a) {
string key <- s + "::" + a;
if (q_table contains_key key) {
return float(q_table[key]);
}
return 0.0;
}
// === NEW: Epsilon-greedy action selection ===
string choose_action {
string s <- get_state();
// With probability epsilon, explore (random action)
if (flip(epsilon)) {
return action_list[rnd(3)];
}
// Otherwise, exploit (pick best known action)
string best_action <- action_list[0];
float best_q <- get_q(s, action_list[0]);
loop a over: action_list {
float q_val <- get_q(s, a);
if (q_val > best_q) {
best_q <- q_val;
best_action <- a;
}
}
return best_action;
}
// === NEW: Q-value update using Bellman equation ===
action update_q_value (string s, string a, float r, string s_next) {
float old_q <- get_q(s, a);
// Find max Q-value for the next state
float max_next_q <- get_q(s_next, action_list[0]);
loop act over: action_list {
float q_val <- get_q(s_next, act);
if (q_val > max_next_q) {
max_next_q <- q_val;
}
}
// Bellman equation: Q(s,a) = Q(s,a) + α * [r + γ * max Q(s',a') - Q(s,a)]
float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
q_table[s + "::" + a] <- new_q;
}
// === MODIFIED: Use Q-Learning instead of random walk ===
reflex act {
string current_state <- get_state();
string action_taken <- choose_action();
// Translate action string to grid movement
int new_x <- my_cell.grid_x;
int new_y <- my_cell.grid_y;
switch action_taken {
match "up" { new_y <- new_y - 1; }
match "right" { new_x <- new_x + 1; }
match "down" { new_y <- new_y + 1; }
match "left" { new_x <- new_x - 1; }
}
float step_reward <- -1.0;
if (new_x >= 0 and new_x < grid_size and new_y >= 0 and new_y < grid_size) {
world_cell target <- world_cell grid_at {new_x, new_y};
if (not target.is_obstacle) {
my_cell <- target;
location <- my_cell.location;
if (my_cell.is_food) {
my_cell.is_food <- false;
step_reward <- 100.0;
food_found <- true;
}
} else {
step_reward <- -5.0;
}
} else {
step_reward <- -5.0;
}
// Q-Learning update
string new_state <- get_state();
do update_q_value(current_state, action_taken, step_reward, new_state);
episode_reward <- episode_reward + step_reward;
}
aspect default {
draw circle(0.8) color: #blue;
}
}
experiment smart_forager type: gui {
parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
parameter "Max steps per episode" var: max_steps_per_episode min: 50 max: 1000 category: "Simulation";
output {
display "Grid World" {
grid world_cell border: #lightgray;
species forager;
graphics "food" {
ask world_cell where each.is_food {
draw circle(5) color: rgb(50, 180, 50);
}
}
}
monitor "Episode" value: episode;
monitor "Step" value: step_count;
monitor "Current Reward" value: episode_reward;
monitor "Last Episode Reward" value: last_episode_reward;
monitor "Epsilon" value: epsilon with_precision 4;
monitor "Q-Table Size" value: length(forager[0].q_table);
}
}