Tutorial_Env - sankhaMukherjee/RLalgos GitHub Wiki

This is a .md version of this Notebook

cd ../src
/home/sankha/Documents/programs/ML/RLalgos/src

1. envGym

envGym is a wrapper around the OpenAI Gym environment. It has several methds that can be useful for using the environment. In this notebook, we shall explore this environment.

from lib.envs import envGym
from time import sleep
import numpy as np
import torch

1.1. The Env() context manager

The Env class exposes a context manager that will allow this environment to generate an OpenAI environment within. The first time this context manager is entered, this creates a current state self.state. This can be reset with the self.reset() method.

name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env(name, showEnv=False) as env:
    print(f'Initial environment state:\n {env.state}')
    env.reset()
    print(f'Initial environment state after reset:\n {env.state}')
Initial environment state:
 [ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   6
 146   0   8   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   0   0 255   0 228   0   0   0   0   0   0   0   0   0   0   0   0   0
   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
 219 242]
Initial environment state after reset:
 [ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   6
 146   0   8   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   0   0 255   0 228   0   0   0   0   0   0   0   0   0   0   0   0   0
   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
 219 242]

1.2. The self.env property

The self.env property points to the OpenAI Gym environment. Use this for any of the OpenAI Gym methods. However, it is best not to use the internal environment directly. It would be much more preferabe to update this environment to create a new method for this specific class.

name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env(name, showEnv=False) as env:
    print(f'Check environments action space: {env.env.action_space}')
    env.env.render()
    sleep(2)
    
env.env.close()
Check environments action space: Discrete(4)

1.3. Stepping and playing

The Env class provides two main methods to interact with the environment - a self.step(policy) and an selef.episode(policy, maxSteps). For making this environment compatible with the envUnity environment, which can simulate more than a single actor at a single time, this policy is a function that should return a number of actions, one for each actor. For this reason, the result of a policy should always be a list of actions. For the envGym environment, this will mean a list with a single action.

1.3.1. Let us take a couple of steps

Note that the Breakout takes a Discrete(4) pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a randoom action ...

name = 'Breakout-ramNoFrameskip-v4'

policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env(name, showEnv=False) as env:
    
    print('Taking the first step')
    result = env.step(policy)
    print(result)
    
    print('\n\nTaking the second step')
    result = env.step(policy)[0]
    state, action, reward, nextState, done = result
    print(f'''
    state     : \n{state}
    nextState : \n{nextState}
    action    : {action}
    done      : {done}
    ''')
    
Taking the first step
[(array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), 1, 0.0, array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   5, 146,   0,   7,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   1,
         0, 255,   0, 227,  71,   0,   0,   0, 127,   0, 113,   0,   1,
         0,   1,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), False)]


Taking the second step

    state     : 
[ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   5
 146   0   7   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   1   0 255   0 227  71   0   0   0 127   0 113   0   1   0   1   0   0
   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
 219 242]
    nextState : 
[ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  94   4
 146   0   6   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   2   0 255   0 226  70   0   0   0 126   0 114   0   1   0   1   0   0
   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
 219 242]
    action    : 3
    done      : False

1.3.1. Let us play an entire episode

Note that the Breakout takes a Discrete(4) pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a randoom action ...

name = 'Breakout-ramNoFrameskip-v4'

policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env(name, showEnv=False) as env:
    
    result = env.episode(policy, 10)[0]
    for r in result:
        state, action, reward, nextState, done = r
        print(f'reward = {reward}, action = {action}, done = {done}')
    
    print(f'final state:\n{state}')
reward = 0.0, action = 0, done = False
reward = 0.0, action = 0, done = False
reward = 0.0, action = 1, done = False
reward = 0.0, action = 2, done = False
reward = 0.0, action = 3, done = False
reward = 0.0, action = 3, done = False
reward = 0.0, action = 0, done = False
reward = 0.0, action = 1, done = False
reward = 0.0, action = 2, done = False
reward = 0.0, action = 2, done = False
final state:
[ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   3
 141   0   6   2   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   9   0 255   0 219  65   0   0   0 185   0 119   0   1   0   1   0   0
   8   0 255 255 255 255 255 255 255   0   0   4   0   0 186 214 117 246
 219 242]