ForagerRL_step10 - gama-platform/gama GitHub Wiki
By Killian Trouillet
Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.
gama-headless.bat -socket 1001./gama-headless.sh -socket 1001Wait for the message indicating the server is ready before running the Python script.
Port choice: Any port except
1000(reserved for GUI). Common choices:1001,6868,8080.
PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.
| Aspect | Q-Learning (Part 1) | PPO (Part 2) |
|---|---|---|
| What it learns | A table of Q-values | A neural network policy |
| Action selection | Pick max Q-value | Sample from policy distribution |
| Exploration | ε-greedy (random with decay) | Entropy bonus (natural noise) |
| State space | Finite (needs a table) | Infinite (network generalizes) |
| Action space | Discrete only | Discrete or continuous |
| Update rule | Bellman equation | Gradient ascent on policy |
| Parameter | Value | Meaning |
|---|---|---|
lr |
3e-4 |
How fast the network updates |
gamma |
0.99 |
Discount factor (same concept as Part 1) |
K_epochs |
10 |
Iterate 10 times over the collected data |
eps_clip |
0.2 |
PPO's key innovation: limits how much the policy can change per update |
ent_coef |
0.01 |
Entropy bonus — encourages exploration |
We build a small neural network in PyTorch that outputs both an action (Actor) and a value estimate (Critic):
class ActorCritic(nn.Module):
def __init__(self, state_dim=13, action_dim=2, hidden=64):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden), nn.Tanh(),
nn.Linear(hidden, hidden), nn.Tanh(),
)
self.actor_mean = nn.Linear(hidden, action_dim)
self.actor_log_std = nn.Parameter(torch.full((action_dim,), -0.5))
self.critic = nn.Linear(hidden, 1)
# Small init — prevents tanh saturation early in training
nn.init.orthogonal_(self.actor_mean.weight, gain=0.01)
nn.init.zeros_(self.actor_mean.bias)
def forward(self, x):
h = self.shared(x)
mean = torch.tanh(self.actor_mean(h)) # actions in [-1, 1]
std = self.actor_log_std.exp().expand_as(mean)
return mean, std, self.critic(h)- Shared backbone: Two hidden layers (64 neurons, Tanh) process the observation
-
Actor head: Outputs a mean action vector bounded to
[-1, 1]bytanh. Small orthogonal initialization keeps outputs near 0 early in training, preventing gradient saturation - Critic head: Outputs a single value estimate (how good is this state?)
-
Normal distribution: Actions are sampled from
Normal(mean, std), giving smooth continuous control
Each episode, we collect a trajectory (states, actions, rewards), then:
- Compute discounted returns: future reward from each step
- Compute advantages: returns − value estimates (how much better was reality vs prediction)
- Run K gradient epochs: update the network using the PPO clipped objective
# Core PPO loss (simplified)
ratio = exp(new_log_prob - old_log_prob)
surr1 = ratio * advantages
surr2 = clamp(ratio, 1 - eps, 1 + eps) * advantages
loss = -min(surr1, surr2) + vf_coef * value_loss - ent_coef * entropyThe clipping prevents the policy from changing too much in one update — that's what makes PPO stable.
import gymnasium as gym
import gama_gymnasium # registers the environment
env = gym.make(
"gama_gymnasium_env/GamaEnv-v0",
gaml_experiment_path="path/to/forager_gym.gaml",
gaml_experiment_name="gym_env",
gama_ip_address="localhost",
gama_port=1001,
)agent = PPOAgent(state_dim=13, action_dim=2)
buffer = RolloutBuffer()
UPDATE_EVERY = 2048
total_steps = 0
for ep in range(1, NUM_EPISODES + 1):
obs, _ = env.reset()
done = False
step = 0
while not done and step < 300:
action, log_prob, value = agent.select_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
buffer.states.append(torch.FloatTensor(obs))
buffer.actions.append(torch.FloatTensor(action))
buffer.logprobs.append(torch.tensor(log_prob))
buffer.values.append(torch.tensor(value))
buffer.rewards.append(reward)
buffer.dones.append(done)
obs = next_obs
step += 1
total_steps += 1
# PPO update every UPDATE_EVERY steps (accumulates data across episodes)
if total_steps >= UPDATE_EVERY:
agent.update(buffer)
buffer.clear()
total_steps = 0
agent.save("saved_models/ppo_forager.pth")
env.close()Why
asyncio? Thegama-gymnasiumlibrary uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so thetrain()function must beasyncand launched withasyncio.run().
cd models/gym
python train_forager.py- Ep 0-100: The forager moves randomly. Most episodes time out (reward ≈ -5).
- Ep 100-300: The forager starts approaching the food. Reward improves gradually.
- Ep 300-500: The forager reliably reaches the food. Reward ≈ 90+.
A reward plot is saved automatically to saved_models/training_rewards.png.
See models/gym/train_forager.py for the full implementation. Key components:
| Component | Purpose |
|---|---|
ActorCritic |
Neural network (shared backbone + actor/critic heads) |
RolloutBuffer |
Stores trajectory data (states, actions, rewards, etc.) |
PPOAgent |
Wraps the network with action selection + PPO update |
plot_training() |
Saves reward curves after training |