ForagerRL_step10 - gama-platform/gama GitHub Wiki

10. Headless Training with PPO

By Killian Trouillet


Step 10: Headless Training with PPO

Starting GAMA Headless

Training uses GAMA in headless mode — no GUI, just a WebSocket server. This is much faster than running with the display.

Windows

gama-headless.bat -socket 1001

Linux / MacOS

./gama-headless.sh -socket 1001

Wait for the message indicating the server is ready before running the Python script.

Port choice: Any port except 1000 (reserved for GUI). Common choices: 1001, 6868, 8080.


Understanding PPO

PPO (Proximal Policy Optimization) is a policy gradient algorithm. Unlike Q-Learning which learns values for state-action pairs, PPO directly learns a policy — a neural network that maps observations to actions.

PPO vs Q-Learning

Aspect Q-Learning (Part 1) PPO (Part 2)
What it learns A table of Q-values A neural network policy
Action selection Pick max Q-value Sample from policy distribution
Exploration ε-greedy (random with decay) Entropy bonus (natural noise)
State space Finite (needs a table) Infinite (network generalizes)
Action space Discrete only Discrete or continuous
Update rule Bellman equation Gradient ascent on policy

Key PPO Hyperparameters

Parameter Value Meaning
lr 3e-4 How fast the network updates
gamma 0.99 Discount factor (same concept as Part 1)
K_epochs 10 Iterate 10 times over the collected data
eps_clip 0.2 PPO's key innovation: limits how much the policy can change per update
ent_coef 0.01 Entropy bonus — encourages exploration

The ActorCritic Network

We build a small neural network in PyTorch that outputs both an action (Actor) and a value estimate (Critic):

class ActorCritic(nn.Module):
    def __init__(self, state_dim=13, action_dim=2, hidden=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden), nn.Tanh(),
            nn.Linear(hidden, hidden), nn.Tanh(),
        )
        self.actor_mean = nn.Linear(hidden, action_dim)
        self.actor_log_std = nn.Parameter(torch.full((action_dim,), -0.5))
        self.critic = nn.Linear(hidden, 1)

        # Small init — prevents tanh saturation early in training
        nn.init.orthogonal_(self.actor_mean.weight, gain=0.01)
        nn.init.zeros_(self.actor_mean.bias)

    def forward(self, x):
        h = self.shared(x)
        mean = torch.tanh(self.actor_mean(h))   # actions in [-1, 1]
        std = self.actor_log_std.exp().expand_as(mean)
        return mean, std, self.critic(h)
  • Shared backbone: Two hidden layers (64 neurons, Tanh) process the observation
  • Actor head: Outputs a mean action vector bounded to [-1, 1] by tanh. Small orthogonal initialization keeps outputs near 0 early in training, preventing gradient saturation
  • Critic head: Outputs a single value estimate (how good is this state?)
  • Normal distribution: Actions are sampled from Normal(mean, std), giving smooth continuous control

The PPO Update

Each episode, we collect a trajectory (states, actions, rewards), then:

  1. Compute discounted returns: future reward from each step
  2. Compute advantages: returns − value estimates (how much better was reality vs prediction)
  3. Run K gradient epochs: update the network using the PPO clipped objective
# Core PPO loss (simplified)
ratio = exp(new_log_prob - old_log_prob)
surr1 = ratio * advantages
surr2 = clamp(ratio, 1 - eps, 1 + eps) * advantages
loss = -min(surr1, surr2) + vf_coef * value_loss - ent_coef * entropy

The clipping prevents the policy from changing too much in one update — that's what makes PPO stable.


The Training Script

Connecting to GAMA

import gymnasium as gym
import gama_gymnasium  # registers the environment

env = gym.make(
    "gama_gymnasium_env/GamaEnv-v0",
    gaml_experiment_path="path/to/forager_gym.gaml",
    gaml_experiment_name="gym_env",
    gama_ip_address="localhost",
    gama_port=1001,
)

Training Loop

agent = PPOAgent(state_dim=13, action_dim=2)
buffer = RolloutBuffer()
UPDATE_EVERY = 2048

total_steps = 0
for ep in range(1, NUM_EPISODES + 1):
    obs, _ = env.reset()
    done = False
    step = 0

    while not done and step < 300:
        action, log_prob, value = agent.select_action(obs)
        next_obs, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated

        buffer.states.append(torch.FloatTensor(obs))
        buffer.actions.append(torch.FloatTensor(action))
        buffer.logprobs.append(torch.tensor(log_prob))
        buffer.values.append(torch.tensor(value))
        buffer.rewards.append(reward)
        buffer.dones.append(done)

        obs = next_obs
        step += 1
        total_steps += 1

        # PPO update every UPDATE_EVERY steps (accumulates data across episodes)
        if total_steps >= UPDATE_EVERY:
            agent.update(buffer)
            buffer.clear()
            total_steps = 0

agent.save("saved_models/ppo_forager.pth")
env.close()

Why asyncio? The gama-gymnasium library uses asynchronous I/O internally to communicate with GAMA's WebSocket server, so the train() function must be async and launched with asyncio.run().


Running the Training

cd models/gym
python train_forager.py

What to Expect

  1. Ep 0-100: The forager moves randomly. Most episodes time out (reward ≈ -5).
  2. Ep 100-300: The forager starts approaching the food. Reward improves gradually.
  3. Ep 300-500: The forager reliably reaches the food. Reward ≈ 90+.

A reward plot is saved automatically to saved_models/training_rewards.png.


Complete Training Script

See models/gym/train_forager.py for the full implementation. Key components:

Component Purpose
ActorCritic Neural network (shared backbone + actor/critic heads)
RolloutBuffer Stores trajectory data (states, actions, rewards, etc.)
PPOAgent Wraps the network with action selection + PPO update
plot_training() Saves reward curves after training
⚠️ **GitHub.com Fallback** ⚠️