ForagerRL_step6 - gama-platform/gama GitHub Wiki

The Smart Forager — Step 6: Charts, Automatic Test and Visualization

By Killian Trouillet


Step 6: Charts, Test Mode and Visualization

Content

This final step adds visual feedback to monitor the training process, a training stop condition, and an automatic test mode.

  • A reward chart showing the reward per episode over time
  • An epsilon decay chart showing how exploration decreases
  • A Q-value heatmap on the grid, coloring cells based on learned values
  • A training stop after a configurable number of episodes
  • An automatic test phase that triggers after training
  • A mode indicator and success rate monitor

Formulation

  • Addition of statistics tracking: reward history, food found counter, best reward
  • Addition of training control: max_episodes, training_done flag, test_episode counter
  • Definition of a heatmap update action that colors grid cells based on Q-values
  • Addition of chart displays (series type) for reward and epsilon
  • Addition of comprehensive monitor elements

Model Definition

Training & Test control variables

We add global variables to control training length and perform slow-motion evaluation:

    // Training & Test control
    int max_episodes <- 500;       // Training stops after this many episodes
    bool training_done <- false;   // Flag: true when training is complete
    int test_step_delay <- 5;      // In test mode, act every N cycles (slower pace)
    int test_episode <- 0;         // Counter for test episodes
  • max_episodes: The agent trains for this many episodes, then the simulation pauses.
  • training_done: Set to true automatically when the episode count reaches max_episodes.
  • test_step_delay: In test mode, the forager only acts every N cycles, making the movement slow enough to follow visually.

Separate Training and Test reflexes

To avoid GAMA's parameter reset issue, we split the logic into two automatic phases. Training runs at full speed, while the test phase uses cycle throttling for visualization.

Training Phase

    reflex manage_training when: not training_done {
        step_count <- step_count + 1;
        if (food_found or step_count >= max_steps_per_episode) {
            do end_training_episode;
        }
    }

Test Phase

    reflex manage_test when: training_done {
        // Slow down for visualization
        if (mod(cycle, test_step_delay) != 0) { return; }
        
        step_count <- step_count + 1;
        if (food_found or step_count >= max_steps_per_episode) {
            do end_test_episode;
        }
    }

Training transition and Pause

When training ends, we print a summary and prepare the first test run, then use do pause to freeze GAMA so the user can read the console.

    if (episode >= max_episodes) {
        training_done <- true;
        // ... summary ...
        do prepare_test;
        do pause;
    }

Pressing Play ▶️ after this point will run a test episode.

Q-value heatmap

After each episode, we color the grid cells based on the maximum Q-value the agent has learned for that cell.

    action update_heatmap {
        ask world_cell where (not each.is_obstacle and not each.is_food) {
            string state_key <- string(grid_x) + "_" + string(grid_y);
            float max_q <- 0.0;
            loop a over: forager[0].action_list {
                float q_val <- forager[0].get_q(state_key, a);
                if (q_val > max_q) {
                    max_q <- q_val;
                }
            }
            if (max_q > 0) {
                int intensity <- min([255, int(max_q * 3)]);
                color <- rgb(255 - intensity, 255 - intensity, 255);
            } else {
                color <- #white;
            }
        }
    }

Charts

display "Training Progress" type: 2d {
    chart "Episode Reward" type: series size: {1.0, 0.5} position: {0, 0} {
        data "Last Reward" value: last_episode_reward color: #blue marker: false;
    }
    chart "Epsilon Decay" type: series size: {1.0, 0.5} position: {0, 0.5} {
        data "Epsilon" value: epsilon color: #red marker: false;
    }
}

Complete Model

/**
* Name: SmartForager - Step 6: Charts, Visualization & Test Mode
* Author: Killian Trouillet
* Description: This final step adds training charts, colored Q-value heatmap on the grid,
*              a training stop condition, and an AUTOMATIC TEST MODE to evaluate the learned policy.
*              After training completes, press Play to watch the agent use its learned policy.
* Tags: reinforcement-learning, q-learning, chart, visualization, test, tutorial
*/

model SmartForager

global {
	int grid_size <- 10;
	int food_x <- 9;
	int food_y <- 9;
	list<point> obstacle_positions <- [{2,2}, {3,2}, {2,3}, {6,4}, {7,4}, {7,5}];
	
	int max_steps_per_episode <- 200;
	int step_count <- 0;
	int episode <- 0;
	float episode_reward <- 0.0;
	float last_episode_reward <- 0.0;
	bool food_found <- false;
	
	// RL Hyperparameters
	float alpha <- 0.1;
	float gamma_rl <- 0.95;
	float epsilon <- 1.0;
	float epsilon_min <- 0.01;
	float epsilon_decay <- 0.995;
	
	// === Training & Test control ===
	int max_episodes <- 500;       // Training stops after this many episodes
	bool training_done <- false;   // Flag: true when training is complete
	int test_step_delay <- 5;      // In test mode, act every N cycles (slower pace)
	int test_episode <- 0;         // Counter for test episodes
	
	// === Statistics tracking ===
	list<float> reward_history;
	int total_food_found <- 0;
	float best_reward <- -1000.0;
	
	init {
		ask world_cell grid_at {food_x, food_y} {
			is_food <- true;
		}
		loop pos over: obstacle_positions {
			ask world_cell grid_at pos {
				is_obstacle <- true;
			}
		}
		create forager number: 1 {
			my_cell <- world_cell grid_at {0, 0};
			location <- my_cell.location;
		}
	}
	
	// === TRAINING PHASE ===
	reflex manage_training when: not training_done {
		step_count <- step_count + 1;
		if (food_found or step_count >= max_steps_per_episode) {
			do end_training_episode;
		}
	}
	
	action end_training_episode {
		episode <- episode + 1;
		last_episode_reward <- episode_reward;
		add episode_reward to: reward_history;
		if (episode_reward > best_reward) {
			best_reward <- episode_reward;
		}
		if (food_found) {
			total_food_found <- total_food_found + 1;
		}
		
		write "Ep " + episode + " | Steps: " + step_count 
		      + " | Reward: " + round(episode_reward) 
		      + " | Eps: " + (epsilon with_precision 3)
		      + " | Found: " + food_found;
		
		// Reset counters
		episode_reward <- 0.0;
		step_count <- 0;
		food_found <- false;
		
		// Decay epsilon
		if (epsilon > epsilon_min) {
			epsilon <- epsilon * epsilon_decay;
		}
		
		// Check if training is complete
		if (episode >= max_episodes) {
			training_done <- true;
			write "";
			write "========================================";
			write "  TRAINING COMPLETE after " + episode + " episodes";
			write "  Best reward: " + round(best_reward);
			write "  Success rate: " + round(total_food_found / episode * 100) + "%";
			write "  Q-Table size: " + length(forager[0].q_table);
			write "========================================";
			write "  Press PLAY to watch the learned policy!";
			write "  (each Play runs one slow test episode)";
			write "========================================";
			
			// Prepare for first test
			do prepare_test;
			do update_heatmap;
			do pause;
			return;
		}
		
		// Reset for next training episode
		ask world_cell grid_at {food_x, food_y} {
			is_food <- true;
		}
		ask forager[0] {
			my_cell <- world_cell grid_at {0, 0};
			location <- my_cell.location;
		}
		
		do update_heatmap;
	}
	
	// === TEST PHASE ===
	reflex manage_test when: training_done {
		// Only act every N cycles for slow visualization
		if (mod(cycle, test_step_delay) != 0) {
			return;
		}
		
		step_count <- step_count + 1;
		if (food_found or step_count >= max_steps_per_episode) {
			do end_test_episode;
		}
	}
	
	action prepare_test {
		step_count <- 0;
		episode_reward <- 0.0;
		food_found <- false;
		ask world_cell grid_at {food_x, food_y} {
			is_food <- true;
		}
		ask forager[0] {
			my_cell <- world_cell grid_at {0, 0};
			location <- my_cell.location;
		}
	}
	
	action end_test_episode {
		test_episode <- test_episode + 1;
		write "";
		write "===== TEST " + test_episode + " FINISHED =====";
		write "  Steps: " + step_count + " | Reward: " + round(episode_reward);
		write "  Food found: " + food_found;
		write "  Press PLAY for another test.";
		write "====================================";
		
		// Prepare for next test (reset position)
		do prepare_test;
		do pause;
	}
	
	// === Q-value heatmap ===
	action update_heatmap {
		ask world_cell where (not each.is_obstacle and not each.is_food) {
			string state_key <- string(grid_x) + "_" + string(grid_y);
			float max_q <- 0.0;
			loop a over: forager[0].action_list {
				float q_val <- forager[0].get_q(state_key, a);
				if (q_val > max_q) {
					max_q <- q_val;
				}
			}
			if (max_q > 0) {
				int intensity <- min([255, int(max_q * 3)]);
				color <- rgb(255 - intensity, 255 - intensity, 255);
			} else {
				color <- #white;
			}
		}
	}
}

grid world_cell width: 10 height: 10 neighbors: 4 {
	bool is_food <- false;
	bool is_obstacle <- false;
	rgb color <- #white update: is_obstacle ? rgb(60, 60, 60) : color;
}

species forager {
	world_cell my_cell;
	
	map<string, float> q_table;
	list<string> action_list <- ["up", "right", "down", "left"];
	
	string get_state {
		return string(my_cell.grid_x) + "_" + string(my_cell.grid_y);
	}
	
	float get_q (string s, string a) {
		string key <- s + "::" + a;
		if (q_table contains_key key) {
			return float(q_table[key]);
		}
		return 0.0;
	}
	
	string choose_action {
		string s <- get_state();
		// After training, always exploit (epsilon forced to 0)
		float effective_epsilon <- training_done ? 0.0 : epsilon;
		if (flip(effective_epsilon)) {
			return action_list[rnd(3)];
		}
		string best_action <- action_list[0];
		float best_q <- get_q(s, action_list[0]);
		loop a over: action_list {
			float q_val <- get_q(s, a);
			if (q_val > best_q) {
				best_q <- q_val;
				best_action <- a;
			}
		}
		return best_action;
	}
	
	action update_q_value (string s, string a, float r, string s_next) {
		float old_q <- get_q(s, a);
		float max_next_q <- get_q(s_next, action_list[0]);
		loop act over: action_list {
			float q_val <- get_q(s_next, act);
			if (q_val > max_next_q) {
				max_next_q <- q_val;
			}
		}
		float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
		q_table[s + "::" + a] <- new_q;
	}
	
	// The act reflex is throttled in test mode via the global manage_test reflex
	reflex act when: (not training_done) or (training_done and mod(cycle, test_step_delay) = 0) {
		string current_state <- get_state();
		string action_taken <- choose_action();
		
		int new_x <- my_cell.grid_x;
		int new_y <- my_cell.grid_y;
		
		switch action_taken {
			match "up"    { new_y <- new_y - 1; }
			match "right" { new_x <- new_x + 1; }
			match "down"  { new_y <- new_y + 1; }
			match "left"  { new_x <- new_x - 1; }
		}
		
		float step_reward <- -1.0;
		
		if (new_x >= 0 and new_x < grid_size and new_y >= 0 and new_y < grid_size) {
			world_cell target <- world_cell grid_at {new_x, new_y};
			if (not target.is_obstacle) {
				my_cell <- target;
				location <- my_cell.location;
				if (my_cell.is_food) {
					my_cell.is_food <- false;
					step_reward <- 100.0;
					food_found <- true;
				}
			} else {
				step_reward <- -5.0;
			}
		} else {
			step_reward <- -5.0;
		}
		
		string new_state <- get_state();
		// Only update Q-values during training
		if (not training_done) {
			do update_q_value(current_state, action_taken, step_reward, new_state);
		}
		
		episode_reward <- episode_reward + step_reward;
	}
	
	aspect default {
		draw circle(0.8) color: training_done ? #orange : #blue;
	}
}

experiment smart_forager type: gui {
	parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
	parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
	parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
	parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
	parameter "Epsilon Min" var: epsilon_min min: 0.0 max: 0.5 category: "RL";
	parameter "Max Training Episodes" var: max_episodes min: 100 max: 5000 category: "Training";
	parameter "Max steps per episode" var: max_steps_per_episode min: 50 max: 1000 category: "Training";
	
	output {
		display "Grid World" {
			grid world_cell border: #lightgray;
			species forager;
			graphics "food" {
				ask world_cell where each.is_food {
					draw circle(5) color: rgb(50, 180, 50);
				}
			}
		}
		
		display "Training Progress" type: 2d {
			chart "Episode Reward" type: series size: {1.0, 0.5} position: {0, 0} {
				data "Last Reward" value: last_episode_reward color: #blue marker: false;
			}
			chart "Epsilon Decay" type: series size: {1.0, 0.5} position: {0, 0.5} {
				data "Epsilon" value: epsilon color: #red marker: false;
			}
		}
		
		monitor "Mode" value: training_done ? "TEST (" + test_episode + ")" : "TRAINING";
		monitor "Episode" value: episode;
		monitor "Step" value: step_count;
		monitor "Current Reward" value: episode_reward;
		monitor "Last Episode Reward" value: last_episode_reward;
		monitor "Best Reward" value: best_reward;
		monitor "Epsilon" value: epsilon with_precision 4;
		monitor "Q-Table Size" value: length(forager[0].q_table);
		monitor "Food Found (total)" value: total_food_found;
		monitor "Success Rate (%)" value: episode > 0 ? round(total_food_found / episode * 100) : 0;
	}
}

How to use

  1. Train: Launch the experiment and press ▶️. The forager (blue) trains for 500 episodes.
  2. Wait: When the console prints TRAINING COMPLETE, the simulation pauses automatically.
  3. Test: Press ▶️ again. The forager turns orange and moves slowly following the optimal policy.
  4. Observe: Reach the food, the simulation pauses with results. Press ▶️ for more tests.

Console output showing training completion and test results


Summary

Concept GAML Implementation
Grid environment grid species with dynamic color update:
Agent on a grid species with a my_cell attribute
Reward function Conditional rewards in a reflex
Episode management Global counters, flags, and reset action
Q-Table map<string, float> with "::" key concatenation
State representation string(grid_x) + "_" + string(grid_y)
Epsilon-greedy flip(epsilon) for explore vs exploit
Q-Learning update Bellman equation in an action
Training stop max_episodes counter + training_done flag
Visualization chart type: series, monitor, heatmap

Key GAML Keywords Used

global, grid, species, reflex, aspect, init, action, experiment, parameter, monitor, chart, display, map, list, switch, match, loop, ask, create, flip, rnd, one_of, contains_key, grid_at, update:, when:, do pause

Next Steps

The Tabular Q-Learning used in these steps has 2 major limits:

  1. Scaling: It only works on small grids. For a 1000x1000 grid or high-definition vision, the table would be too big.
  2. Discrete Space: It requires a grid. It cannot handle continuous movement or actions.

In the next part, we will solve these problems using Deep Reinforcement Learning:

  • Part 2: Connect this model to Python using gama-gymnasium to handle continuous movement and complex observations with neural networks (PPO algorithm).
  • Part 3: Extend to multiple foragers using gama-pettingzoo for Multi-Agent learning.
⚠️ **GitHub.com Fallback** ⚠️