RL with til_environment - til-ai/til-25 GitHub Wiki

This page introduces the basics of using the til_environment package to run your RL models.

Contents

Setup

Navigate to the til-25 repo and install the til_environment package with pip install -r requirements-dev.txt.

Usage

After installation, you should be able to access the RL environment as such:

from til_environment import gridworld

env = gridworld.env()

The gridworld.env() method accepts the following arguments:

  • env_wrappers: A list of wrappers for the environment.
    • If None, defaults to a set of example wrappers where the input observation dict is flattened (FlattenDictWrapper) and the past 4 output frames are stacked together (supersuit.frame_stack_v2 with stack_size=4 and stack_dim=-1).
    • If you don't want to pass in any wrappers, pass in an empty list [].
  • render_mode: One of "human", "rgb_array", or None.
    • "human" renders the environment in a pygame window.
    • "rgb_array" returns the environment as a RGB pixel array. Useful for recording videos of your agent's actions for debugging.
    • None disables rendering. Useful for training.
  • debug: Whether to log additional debug information and show a debug panel during rendering.
  • novice: If True, fixes the map layout to that used by Novice teams throughout the competition. Advanced teams should set this to False, because your trained RL agent is expected to generalize to maps not known to you ahead of time.
  • rewards_dict: Mapping of reward names to values. Useful for simple reward shaping.
  • window_size: The size of the pygame render window, in pixels. Defaults to 768.

Running the environment

The environment is built with PettingZoo, a popular multi-agent reinforcement learning (MARL) environment library. Run it as such:

from til_environment import gridworld

env = gridworld.env(
    env_wrappers=[],  # clear out default env wrappers
    render_mode="human",  # Render the map; not visible on Workbench
    debug=True,  # Enable debug mode
    novice=True,  # Use same map layout every time (for Novice teams only)
)
env.reset(seed=42)

for agent in env.agent_iter():
    observation, reward, termination, truncation, info = env.last()

    if termination or truncation:
        break
    else:
        # Insert your policy here
        action = env.action_space(agent).sample()

    env.step(action)

env.close()

Use during training

Reward shaping

For simple reward shaping, create a new rewards_dict to pass in as a parameter to gridworld.env(rewards_dict=YOUR_REWARDS_DICT).

For complex reward shaping, you may have to write a wrapper or write your own class/function to provide additional rewards based on behaviours/actions/state.

There is a RewardNames enum that has a list of all the possible rewards_dict key names, with the following behaviours:

RewardNames Key Corresponding Behaviour
GUARD_WINS Shared reward for all Guards if any Guard captures
GUARD_CAPTURES For all "capturers" since it's possible for multiple Guards to capture
SCOUT_CAPTURED For Scout if captured
SCOUT_RECON Scout collects recon
SCOUT_MISSION Scout collects mission
WALL_COLLISION For agent if they collide with wall
AGENT_COLLIDER For agent which collides into another
AGENT_COLLIDEE For agent which is collided into
STATIONARY_PENALTY For agent who takes action Action.STAY
GUARD_TRUNCATION For Guards if round ends without capture
SCOUT_TRUNCATION For Scouts if round ends without capture
GUARD_STEP For Guards when step increments
SCOUT_STEP For Scouts when step increments

Writing an environment wrapper

Participants may wish to modify the observations received by their agent, or the function used to calculate its reward during training. This can be achieved by passing the default environment into a custom wrapper.

Custom wrappers can be created by inheriting from BaseWrapper as follows:

import functools
from pettingzoo.utils.env import ActionType, AECEnv, AgentID, ObsType
from pettingzoo.utils.wrappers.base import BaseWrapper

class CustomWrapper(BaseWrapper[AgentID, ObsType, ActionType]):
    def __init__(
        self,
        env: AECEnv[AgentID, ObsType, ActionType],
    ):
        super().__init__(env)

    def reset(self, seed=None, options=None):
        super().reset(seed, options)

    def step(self, action: ActionType):
        super().step(action)

    def observe(self, agent: AgentID) -> ObsType | None:
        obs = super().observe(agent)
        return obs

    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        space = super().observation_space(agent)
        return space

Wrap the environment using env = CustomWrapper(env). The gridworld.env() wrapper has an argument provided to pass in a list of any custom environment wrappers you want: gridworld.env(env_wrappers=[CustomWrapper]).

For a collection of other PettingZoo environment wrappers, see PettingZoo's own Wrappers, as well as those as part of the SuperSuit package.

Observation shaping and persistence

Either write a wrapper and put your logic in there, or write your own class/function to properly format the provided environment. Then when deploying your agent in a Docker container for submission, remember to replicate your observation shaping/persistence logic in rl/src/rl_manager.py.

For example, after your CustomWrapper logic, you likely would want to flatten the agent observation to a 1D array for ease of training RL agents. You can thus pass both your CustomWrapper as well as the provided FlattenDictWrapper to env while training:

from til_environment.flatten_dict import FlattenDictWrapper
from til_environment import gridworld

class CustomWrapper(BaseWrapper[AgentID, ObsType, ActionType]):
    ...

env = gridworld.env(env_wrappers=[CustomWrapper, FlattenDictWrapper])

Then, in your rl_manager.py, you would have something like this to replicate the FlattenDictWrapper logic:

from gymnasium.spaces import flatten

class RLManager:
    def __init__(self):
        self.space = ...
        self.model = ...

    def rl(self, observation: dict[str, int | list[int]]) -> int:
        ...

        obs = flatten(self.space, observation)
        return self.model.predict(obs)