ForagerRL_step6 - gama-platform/gama GitHub Wiki
By Killian Trouillet
This final step adds visual feedback to monitor the training process, a training stop condition, and an automatic test mode.
- A reward chart showing the reward per episode over time
- An epsilon decay chart showing how exploration decreases
- A Q-value heatmap on the grid, coloring cells based on learned values
- A training stop after a configurable number of episodes
- An automatic test phase that triggers after training
- A mode indicator and success rate monitor
- Addition of statistics tracking: reward history, food found counter, best reward
- Addition of training control:
max_episodes,training_doneflag,test_episodecounter - Definition of a heatmap update action that colors grid cells based on Q-values
- Addition of
chartdisplays (series type) for reward and epsilon - Addition of comprehensive
monitorelements
We add global variables to control training length and perform slow-motion evaluation:
// Training & Test control
int max_episodes <- 500; // Training stops after this many episodes
bool training_done <- false; // Flag: true when training is complete
int test_step_delay <- 5; // In test mode, act every N cycles (slower pace)
int test_episode <- 0; // Counter for test episodes
-
max_episodes: The agent trains for this many episodes, then the simulation pauses. -
training_done: Set totrueautomatically when the episode count reachesmax_episodes. -
test_step_delay: In test mode, the forager only acts every N cycles, making the movement slow enough to follow visually.
To avoid GAMA's parameter reset issue, we split the logic into two automatic phases. Training runs at full speed, while the test phase uses cycle throttling for visualization.
reflex manage_training when: not training_done {
step_count <- step_count + 1;
if (food_found or step_count >= max_steps_per_episode) {
do end_training_episode;
}
}
reflex manage_test when: training_done {
// Slow down for visualization
if (mod(cycle, test_step_delay) != 0) { return; }
step_count <- step_count + 1;
if (food_found or step_count >= max_steps_per_episode) {
do end_test_episode;
}
}
When training ends, we print a summary and prepare the first test run, then use do pause to freeze GAMA so the user can read the console.
if (episode >= max_episodes) {
training_done <- true;
// ... summary ...
do prepare_test;
do pause;
}
Pressing Play
After each episode, we color the grid cells based on the maximum Q-value the agent has learned for that cell.
action update_heatmap {
ask world_cell where (not each.is_obstacle and not each.is_food) {
string state_key <- string(grid_x) + "_" + string(grid_y);
float max_q <- 0.0;
loop a over: forager[0].action_list {
float q_val <- forager[0].get_q(state_key, a);
if (q_val > max_q) {
max_q <- q_val;
}
}
if (max_q > 0) {
int intensity <- min([255, int(max_q * 3)]);
color <- rgb(255 - intensity, 255 - intensity, 255);
} else {
color <- #white;
}
}
}
display "Training Progress" type: 2d {
chart "Episode Reward" type: series size: {1.0, 0.5} position: {0, 0} {
data "Last Reward" value: last_episode_reward color: #blue marker: false;
}
chart "Epsilon Decay" type: series size: {1.0, 0.5} position: {0, 0.5} {
data "Epsilon" value: epsilon color: #red marker: false;
}
}
/**
* Name: SmartForager - Step 6: Charts, Visualization & Test Mode
* Author: Killian Trouillet
* Description: This final step adds training charts, colored Q-value heatmap on the grid,
* a training stop condition, and an AUTOMATIC TEST MODE to evaluate the learned policy.
* After training completes, press Play to watch the agent use its learned policy.
* Tags: reinforcement-learning, q-learning, chart, visualization, test, tutorial
*/
model SmartForager
global {
int grid_size <- 10;
int food_x <- 9;
int food_y <- 9;
list<point> obstacle_positions <- [{2,2}, {3,2}, {2,3}, {6,4}, {7,4}, {7,5}];
int max_steps_per_episode <- 200;
int step_count <- 0;
int episode <- 0;
float episode_reward <- 0.0;
float last_episode_reward <- 0.0;
bool food_found <- false;
// RL Hyperparameters
float alpha <- 0.1;
float gamma_rl <- 0.95;
float epsilon <- 1.0;
float epsilon_min <- 0.01;
float epsilon_decay <- 0.995;
// === Training & Test control ===
int max_episodes <- 500; // Training stops after this many episodes
bool training_done <- false; // Flag: true when training is complete
int test_step_delay <- 5; // In test mode, act every N cycles (slower pace)
int test_episode <- 0; // Counter for test episodes
// === Statistics tracking ===
list<float> reward_history;
int total_food_found <- 0;
float best_reward <- -1000.0;
init {
ask world_cell grid_at {food_x, food_y} {
is_food <- true;
}
loop pos over: obstacle_positions {
ask world_cell grid_at pos {
is_obstacle <- true;
}
}
create forager number: 1 {
my_cell <- world_cell grid_at {0, 0};
location <- my_cell.location;
}
}
// === TRAINING PHASE ===
reflex manage_training when: not training_done {
step_count <- step_count + 1;
if (food_found or step_count >= max_steps_per_episode) {
do end_training_episode;
}
}
action end_training_episode {
episode <- episode + 1;
last_episode_reward <- episode_reward;
add episode_reward to: reward_history;
if (episode_reward > best_reward) {
best_reward <- episode_reward;
}
if (food_found) {
total_food_found <- total_food_found + 1;
}
write "Ep " + episode + " | Steps: " + step_count
+ " | Reward: " + round(episode_reward)
+ " | Eps: " + (epsilon with_precision 3)
+ " | Found: " + food_found;
// Reset counters
episode_reward <- 0.0;
step_count <- 0;
food_found <- false;
// Decay epsilon
if (epsilon > epsilon_min) {
epsilon <- epsilon * epsilon_decay;
}
// Check if training is complete
if (episode >= max_episodes) {
training_done <- true;
write "";
write "========================================";
write " TRAINING COMPLETE after " + episode + " episodes";
write " Best reward: " + round(best_reward);
write " Success rate: " + round(total_food_found / episode * 100) + "%";
write " Q-Table size: " + length(forager[0].q_table);
write "========================================";
write " Press PLAY to watch the learned policy!";
write " (each Play runs one slow test episode)";
write "========================================";
// Prepare for first test
do prepare_test;
do update_heatmap;
do pause;
return;
}
// Reset for next training episode
ask world_cell grid_at {food_x, food_y} {
is_food <- true;
}
ask forager[0] {
my_cell <- world_cell grid_at {0, 0};
location <- my_cell.location;
}
do update_heatmap;
}
// === TEST PHASE ===
reflex manage_test when: training_done {
// Only act every N cycles for slow visualization
if (mod(cycle, test_step_delay) != 0) {
return;
}
step_count <- step_count + 1;
if (food_found or step_count >= max_steps_per_episode) {
do end_test_episode;
}
}
action prepare_test {
step_count <- 0;
episode_reward <- 0.0;
food_found <- false;
ask world_cell grid_at {food_x, food_y} {
is_food <- true;
}
ask forager[0] {
my_cell <- world_cell grid_at {0, 0};
location <- my_cell.location;
}
}
action end_test_episode {
test_episode <- test_episode + 1;
write "";
write "===== TEST " + test_episode + " FINISHED =====";
write " Steps: " + step_count + " | Reward: " + round(episode_reward);
write " Food found: " + food_found;
write " Press PLAY for another test.";
write "====================================";
// Prepare for next test (reset position)
do prepare_test;
do pause;
}
// === Q-value heatmap ===
action update_heatmap {
ask world_cell where (not each.is_obstacle and not each.is_food) {
string state_key <- string(grid_x) + "_" + string(grid_y);
float max_q <- 0.0;
loop a over: forager[0].action_list {
float q_val <- forager[0].get_q(state_key, a);
if (q_val > max_q) {
max_q <- q_val;
}
}
if (max_q > 0) {
int intensity <- min([255, int(max_q * 3)]);
color <- rgb(255 - intensity, 255 - intensity, 255);
} else {
color <- #white;
}
}
}
}
grid world_cell width: 10 height: 10 neighbors: 4 {
bool is_food <- false;
bool is_obstacle <- false;
rgb color <- #white update: is_obstacle ? rgb(60, 60, 60) : color;
}
species forager {
world_cell my_cell;
map<string, float> q_table;
list<string> action_list <- ["up", "right", "down", "left"];
string get_state {
return string(my_cell.grid_x) + "_" + string(my_cell.grid_y);
}
float get_q (string s, string a) {
string key <- s + "::" + a;
if (q_table contains_key key) {
return float(q_table[key]);
}
return 0.0;
}
string choose_action {
string s <- get_state();
// After training, always exploit (epsilon forced to 0)
float effective_epsilon <- training_done ? 0.0 : epsilon;
if (flip(effective_epsilon)) {
return action_list[rnd(3)];
}
string best_action <- action_list[0];
float best_q <- get_q(s, action_list[0]);
loop a over: action_list {
float q_val <- get_q(s, a);
if (q_val > best_q) {
best_q <- q_val;
best_action <- a;
}
}
return best_action;
}
action update_q_value (string s, string a, float r, string s_next) {
float old_q <- get_q(s, a);
float max_next_q <- get_q(s_next, action_list[0]);
loop act over: action_list {
float q_val <- get_q(s_next, act);
if (q_val > max_next_q) {
max_next_q <- q_val;
}
}
float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
q_table[s + "::" + a] <- new_q;
}
// The act reflex is throttled in test mode via the global manage_test reflex
reflex act when: (not training_done) or (training_done and mod(cycle, test_step_delay) = 0) {
string current_state <- get_state();
string action_taken <- choose_action();
int new_x <- my_cell.grid_x;
int new_y <- my_cell.grid_y;
switch action_taken {
match "up" { new_y <- new_y - 1; }
match "right" { new_x <- new_x + 1; }
match "down" { new_y <- new_y + 1; }
match "left" { new_x <- new_x - 1; }
}
float step_reward <- -1.0;
if (new_x >= 0 and new_x < grid_size and new_y >= 0 and new_y < grid_size) {
world_cell target <- world_cell grid_at {new_x, new_y};
if (not target.is_obstacle) {
my_cell <- target;
location <- my_cell.location;
if (my_cell.is_food) {
my_cell.is_food <- false;
step_reward <- 100.0;
food_found <- true;
}
} else {
step_reward <- -5.0;
}
} else {
step_reward <- -5.0;
}
string new_state <- get_state();
// Only update Q-values during training
if (not training_done) {
do update_q_value(current_state, action_taken, step_reward, new_state);
}
episode_reward <- episode_reward + step_reward;
}
aspect default {
draw circle(0.8) color: training_done ? #orange : #blue;
}
}
experiment smart_forager type: gui {
parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
parameter "Epsilon Min" var: epsilon_min min: 0.0 max: 0.5 category: "RL";
parameter "Max Training Episodes" var: max_episodes min: 100 max: 5000 category: "Training";
parameter "Max steps per episode" var: max_steps_per_episode min: 50 max: 1000 category: "Training";
output {
display "Grid World" {
grid world_cell border: #lightgray;
species forager;
graphics "food" {
ask world_cell where each.is_food {
draw circle(5) color: rgb(50, 180, 50);
}
}
}
display "Training Progress" type: 2d {
chart "Episode Reward" type: series size: {1.0, 0.5} position: {0, 0} {
data "Last Reward" value: last_episode_reward color: #blue marker: false;
}
chart "Epsilon Decay" type: series size: {1.0, 0.5} position: {0, 0.5} {
data "Epsilon" value: epsilon color: #red marker: false;
}
}
monitor "Mode" value: training_done ? "TEST (" + test_episode + ")" : "TRAINING";
monitor "Episode" value: episode;
monitor "Step" value: step_count;
monitor "Current Reward" value: episode_reward;
monitor "Last Episode Reward" value: last_episode_reward;
monitor "Best Reward" value: best_reward;
monitor "Epsilon" value: epsilon with_precision 4;
monitor "Q-Table Size" value: length(forager[0].q_table);
monitor "Food Found (total)" value: total_food_found;
monitor "Success Rate (%)" value: episode > 0 ? round(total_food_found / episode * 100) : 0;
}
}
-
Train: Launch the experiment and press
▶️ . The forager (blue) trains for 500 episodes. -
Wait: When the console prints
TRAINING COMPLETE, the simulation pauses automatically. -
Test: Press
▶️ again. The forager turns orange and moves slowly following the optimal policy. -
Observe: Reach the food, the simulation pauses with results. Press
▶️ for more tests.

| Concept | GAML Implementation |
|---|---|
| Grid environment |
grid species with dynamic color update:
|
| Agent on a grid |
species with a my_cell attribute |
| Reward function | Conditional rewards in a reflex
|
| Episode management | Global counters, flags, and reset action
|
| Q-Table |
map<string, float> with "::" key concatenation |
| State representation | string(grid_x) + "_" + string(grid_y) |
| Epsilon-greedy |
flip(epsilon) for explore vs exploit |
| Q-Learning update | Bellman equation in an action
|
| Training stop |
max_episodes counter + training_done flag |
| Visualization |
chart type: series, monitor, heatmap |
global, grid, species, reflex, aspect, init, action, experiment, parameter, monitor, chart, display, map, list, switch, match, loop, ask, create, flip, rnd, one_of, contains_key, grid_at, update:, when:, do pause
The Tabular Q-Learning used in these steps has 2 major limits:
- Scaling: It only works on small grids. For a 1000x1000 grid or high-definition vision, the table would be too big.
- Discrete Space: It requires a grid. It cannot handle continuous movement or actions.
In the next part, we will solve these problems using Deep Reinforcement Learning:
-
Part 2: Connect this model to Python using
gama-gymnasiumto handle continuous movement and complex observations with neural networks (PPO algorithm). -
Part 3: Extend to multiple foragers using
gama-pettingzoofor Multi-Agent learning.