OpenAI Gym Environment - HU-ICT-LAB/RobotWars GitHub Wiki

Setting up a OpenAI Gym Environment

OpenAI Gym is the solution to the need for a unified environment system in which reinforcement learning policies can be trained.

OpenAI Gym does, however, not have the base capacity to adjust said policies on it's own; the context for training these policies varies too much on a case by case basis to have a consistent method for doing this. As such we will need to look further into how to exactly apply reinforcement learning in this environment.

It does, however, have standard rules about what variables are returned from the environment during training (coming from env.step()). These are (cited from the OpenAI Gym Docs Page);

  • observation (object): an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.
  • reward (float): amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.
  • done (boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.)
  • info (dict): diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.

OpenAI Gym Example: CartPole-v0

To demonstrate how OpenAI Gym Works we can set up a relatively simple random policy agent from the OpenAI Gym Docs Page.

First off, install OpenAI Gym as follows by running the following command in your terminal;

pip install gym

Once that is done, you can run the following code;

import gym
env = gym.make('CartPole-v0')
for i_episode in range(100):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

This should start up the simulation where a random policy agent (an agent which unconditionally takes a random action) tries to hold the pole upwards as long as possible. The episode terminates when the pole deviates more than 15 degrees from the top or 100 timesteps have passed. During the episode sequence various variables are produced from the env.step() method, which takes an action that the agent should take in this context.

Sources

Related issues

Issues: #38