5. Q-Learning Algorithm

By Killian Trouillet

Step 5: Q-Learning Algorithm

Content

This is the key step. We replace random movement with the Q-Learning algorithm. The agent now makes decisions based on its Q-Table and improves its policy after every step using the Bellman equation.

Formulation

Definition of RL hyperparameters: learning rate (α), discount factor (γ), epsilon (ε)
Implementation of epsilon-greedy action selection
Implementation of the Q-Learning update rule (Bellman equation)
Decay of epsilon over episodes (exploration → exploitation)

Model Definition

Hyperparameters

We define the RL parameters as global variables so they can be adjusted via the experiment interface:

global {
    // ... (previous variables)
    float alpha <- 0.1;         // Learning rate: how fast the agent learns
    float gamma_rl <- 0.95;     // Discount factor: importance of future rewards
    float epsilon <- 1.0;       // Exploration rate: probability of random action
    float epsilon_min <- 0.01;  // Minimum exploration rate
    float epsilon_decay <- 0.995; // Multiply epsilon by this after each episode

α (alpha): Controls how much new information overrides old information. Higher = faster learning but less stable.
γ (gamma_rl): How much the agent values future rewards vs. immediate ones. 0.95 means future rewards are almost as important as immediate ones.
ε (epsilon): The probability of taking a random action instead of the best known action. Starts at 1.0 (100% random) and decays over time.

Epsilon-greedy action selection

This is the core decision-making function. With probability ε, the agent explores (picks a random action). Otherwise, it exploits (picks the action with the highest Q-value):

    string choose_action {
        string s <- get_state();
        // Explore: random action
        if (flip(epsilon)) {
            return action_list[rnd(3)];
        }
        // Exploit: best known action
        string best_action <- action_list[0];
        float best_q <- get_q(s, action_list[0]);
        loop a over: action_list {
            float q_val <- get_q(s, a);
            if (q_val > best_q) {
                best_q <- q_val;
                best_action <- a;
            }
        }
        return best_action;
    }

flip(epsilon): Returns true with probability epsilon.

Q-Learning update (Bellman equation)

After every action, we update the Q-value for the (state, action) pair we just used. The formula is:

Q(s,a) ← Q(s,a) + α × [r + γ × max Q(s', a') − Q(s,a)]

In GAML:

    action update_q_value (string s, string a, float r, string s_next) {
        float old_q <- get_q(s, a);
        // Find the max Q-value achievable from the new state
        float max_next_q <- get_q(s_next, action_list[0]);
        loop act over: action_list {
            float q_val <- get_q(s_next, act);
            if (q_val > max_next_q) {
                max_next_q <- q_val;
            }
        }
        // Apply the Bellman equation
        float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
        q_table[s + "::" + a] <- new_q;
    }

Modified movement reflex

The random_move reflex becomes act, using the Q-Learning pipeline:

    reflex act {
        string current_state <- get_state();
        string action_taken <- choose_action();

        // ... (movement and reward logic as before)

        string new_state <- get_state();
        do update_q_value(current_state, action_taken, step_reward, new_state);
        episode_reward <- episode_reward + step_reward;
    }

Epsilon decay

At the end of each episode, we reduce epsilon to make the agent explore less and exploit more:

    // In end_episode action:
    if (epsilon > epsilon_min) {
        epsilon <- epsilon * epsilon_decay;
    }

Experiment parameters

We expose the hyperparameters in the GUI so users can experiment:

experiment smart_forager type: gui {
    parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
    parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
    parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
    parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
    // ...

Complete Model

/**
* Name: SmartForager - Step 5: Q-Learning Algorithm
* Author: Killian Trouillet
* Description: This fifth step replaces random movement with the Q-Learning algorithm.
*              The forager now uses epsilon-greedy action selection and updates its Q-Table
*              after each step using the Bellman equation.
* Tags: reinforcement-learning, q-learning, epsilon-greedy, tutorial
*/

model SmartForager

global {
	int grid_size <- 10;
	int food_x <- 9;
	int food_y <- 9;
	list<point> obstacle_positions <- [{2,2}, {3,2}, {2,3}, {6,4}, {7,4}, {7,5}];
	
	int max_steps_per_episode <- 200;
	int step_count <- 0;
	int episode <- 0;
	float episode_reward <- 0.0;
	float last_episode_reward <- 0.0;
	bool food_found <- false;
	
	// === NEW: RL Hyperparameters ===
	float alpha <- 0.1;        // Learning rate
	float gamma_rl <- 0.95;    // Discount factor
	float epsilon <- 1.0;      // Exploration rate (starts high)
	float epsilon_min <- 0.01; // Minimum exploration rate
	float epsilon_decay <- 0.995; // Decay factor per episode
	
	init {
		ask world_cell grid_at {food_x, food_y} {
			is_food <- true;
		}
		loop pos over: obstacle_positions {
			ask world_cell grid_at pos {
				is_obstacle <- true;
			}
		}
		create forager number: 1 {
			my_cell <- world_cell grid_at {0, 0};
			location <- my_cell.location;
		}
	}
	
	reflex manage_episode {
		step_count <- step_count + 1;
		if (food_found or step_count >= max_steps_per_episode) {
			do end_episode();
		}
	}
	
	action end_episode() {
		episode <- episode + 1;
		last_episode_reward <- episode_reward;
		write "Ep " + episode + " | Steps: " + step_count 
		      + " | Reward: " + round(episode_reward) 
		      + " | Eps: " + (epsilon with_precision 3)
		      + " | Q-size: " + length(forager[0].q_table);
		
		episode_reward <- 0.0;
		step_count <- 0;
		food_found <- false;
		
		// Decay epsilon
		if (epsilon > epsilon_min) {
			epsilon <- epsilon * epsilon_decay;
		}
		
		ask world_cell grid_at {food_x, food_y} {
			is_food <- true;
		}
		ask forager[0] {
			my_cell <- world_cell grid_at {0, 0};
			location <- my_cell.location;
		}
	}
}

grid world_cell width: 10 height: 10 neighbors: 4 {
	bool is_food <- false;
	bool is_obstacle <- false;
	rgb color <- #white update: is_obstacle ? rgb(60, 60, 60) : #white;
}

species forager {
	world_cell my_cell;
	
	map<string, float> q_table;
	list<string> action_list <- ["up", "right", "down", "left"];
	
	string get_state() {
		return string(my_cell.grid_x) + "_" + string(my_cell.grid_y);
	}
	
	float get_q (string s, string a) {
		string key <- s + "::" + a;
		if (q_table contains_key key) {
			return float(q_table[key]);
		}
		return 0.0;
	}
	
	// === NEW: Epsilon-greedy action selection ===
	string choose_action() {
		string s <- get_state();
		// With probability epsilon, explore (random action)
		if (flip(epsilon)) {
			return action_list[rnd(3)];
		}
		// Otherwise, exploit (pick best known action)
		string best_action <- action_list[0];
		float best_q <- get_q(s, action_list[0]);
		loop a over: action_list {
			float q_val <- get_q(s, a);
			if (q_val > best_q) {
				best_q <- q_val;
				best_action <- a;
			}
		}
		return best_action;
	}
	
	// === NEW: Q-value update using Bellman equation ===
	action update_q_value (string s, string a, float r, string s_next) {
		float old_q <- get_q(s, a);
		// Find max Q-value for the next state
		float max_next_q <- get_q(s_next, action_list[0]);
		loop act over: action_list {
			float q_val <- get_q(s_next, act);
			if (q_val > max_next_q) {
				max_next_q <- q_val;
			}
		}
		// Bellman equation: Q(s,a) = Q(s,a) + α * [r + γ * max Q(s',a') - Q(s,a)]
		float new_q <- old_q + alpha * (r + gamma_rl * max_next_q - old_q);
		q_table[s + "::" + a] <- new_q;
	}
	
	// === MODIFIED: Use Q-Learning instead of random walk ===
	reflex act {
		string current_state <- get_state();
		string action_taken <- choose_action();
		
		// Translate action string to grid movement
		int new_x <- my_cell.grid_x;
		int new_y <- my_cell.grid_y;
		
		switch action_taken {
			match "up"    { new_y <- new_y - 1; }
			match "right" { new_x <- new_x + 1; }
			match "down"  { new_y <- new_y + 1; }
			match "left"  { new_x <- new_x - 1; }
		}
		
		float step_reward <- -1.0;
		
		if (new_x >= 0 and new_x < grid_size and new_y >= 0 and new_y < grid_size) {
			world_cell target <- world_cell grid_at {new_x, new_y};
			if (not target.is_obstacle) {
				my_cell <- target;
				location <- my_cell.location;
				if (my_cell.is_food) {
					my_cell.is_food <- false;
					step_reward <- 100.0;
					food_found <- true;
				}
			} else {
				step_reward <- -5.0;
			}
		} else {
			step_reward <- -5.0;
		}
		
		// Q-Learning update
		string new_state <- get_state();
		do update_q_value(current_state, action_taken, step_reward, new_state);
		
		episode_reward <- episode_reward + step_reward;
	}
	
	aspect default {
		draw circle(0.8) color: #blue;
	}
}

experiment smart_forager type: gui {
	parameter "Learning Rate (α)" var: alpha min: 0.01 max: 1.0 category: "RL";
	parameter "Discount Factor (γ)" var: gamma_rl min: 0.0 max: 1.0 category: "RL";
	parameter "Initial Epsilon (ε)" var: epsilon min: 0.0 max: 1.0 category: "RL";
	parameter "Epsilon Decay" var: epsilon_decay min: 0.9 max: 1.0 category: "RL";
	parameter "Max steps per episode" var: max_steps_per_episode min: 50 max: 1000 category: "Simulation";
	
	output {
		display "Grid World" {
			grid world_cell border: #lightgray;
			species forager;
			graphics "food" {
				ask world_cell where each.is_food {
					draw circle(5) color: rgb(50, 180, 50);
				}
			}
		}
		monitor "Episode" value: episode;
		monitor "Step" value: step_count;
		monitor "Current Reward" value: episode_reward;
		monitor "Last Episode Reward" value: last_episode_reward;
		monitor "Epsilon" value: epsilon with_precision 4;
		monitor "Q-Table Size" value: length(forager[0].q_table);
	}
}

ForagerRL_step5 - gama-platform/gama GitHub Wiki

5. Q-Learning Algorithm

Step 5: Q-Learning Algorithm

Content

Formulation

Model Definition

Hyperparameters

Epsilon-greedy action selection

Q-Learning update (Bellman equation)

Modified movement reflex

Epsilon decay

Experiment parameters

Complete Model

⚠️ GitHub.com Fallback ⚠️

ForagerRL_step5 - gama-platform/gama GitHub Wiki

5. Q-Learning Algorithm

Step 5: Q-Learning Algorithm

Content

Formulation

Model Definition

Hyperparameters

Epsilon-greedy action selection

Q-Learning update (Bellman equation)

Modified movement reflex

Epsilon decay

Experiment parameters

Complete Model

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️