Tutorial_Env1D - sankhaMukherjee/RLalgos GitHub Wiki

This is a .md version of this Notebook

cd ../src

/home/sankha/Documents/programs/ML/RLalgos/src

1. `envGym`

envGym is a wrapper around the OpenAI Gym environment. It has several methds that can be useful for using the environment. In this notebook, we shall explore this environment.

from lib.envs import envGym
from time import sleep
import numpy as np
import torch

1.1. The `Env1D()` context manager

The Env class exposes a context manager that will allow this environment to generate an OpenAI environment within. The first time this context manager is entered, this creates a current state self.state. This can be reset with the self.reset() method. This introduces a new input parameter N that will allow you to specify how many earlier states you want to consider to determine a single state.

name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env1D(name, showEnv=False, N=2) as env:
    print(f'Initial environment state:\n {env.state}')
    env.reset()
    print(f'Initial environment state after reset:\n {env.state}')

Initial environment state:
 deque([array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8)], maxlen=3)
Initial environment state after reset:
 deque([array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8)], maxlen=3)

1.2. The `self.env` property

The self.env property points to the OpenAI Gym environment. Use this for any of the OpenAI Gym methods. However, it is best not to use the internal environment directly. It would be much more preferabe to update this environment to create a new method for this specific class.

name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env1D(name, showEnv=False) as env:
    print(f'Check environments action space: {env.env.action_space}')
    env.env.render()
    sleep(2)
    
env.env.close()

Check environments action space: Discrete(4)

1.3. Stepping and playing

The Env class provides two main methods to interact with the environment - a self.step(policy) and an selef.episode(policy, maxSteps). For making this environment compatible with the envUnity environment, which can simulate more than a single actor at a single time, this policy is a function that should return a number of actions, one for each actor. For this reason, the result of a policy should always be a list of actions. For the envGym environment, this will mean a list with a single action.

1.3.1. Let us take a couple of steps

Note that the Breakout takes a Discrete(4) pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a random action. It is worth noting that the size of the state is twice the size that will be obtained from using Env rather than Env1

name = 'Breakout-ramNoFrameskip-v4'

policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env1D(name, N=2, showEnv=False) as env:
    
    print('Taking the first step')
    result = env.step(policy)
    print(result)
    
    print('\n\nTaking the second step')
    result = env.step(policy)[0]
    state, action, reward, nextState, done = result
    print(f'''
    state     : \n{state}
    nextState : \n{nextState}
    action    : {action}
    done      : {done}
    ''')

Taking the first step
[(array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242,  63,  63,
        63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0, 255,   0,
         0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134, 198,  22,
        38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,   0,   0,
         0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,   0, 255,
         0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,   0,   0,
         5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), 1, 0.0, array([ 63,  63,  63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0,
       255,   0,   0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134,
       198,  22,  38,  54,  70,  88,   6, 146,   0,   8,   0,   0,   0,
         0,   0,   0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   0,
         0, 255,   0, 228,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,
         0,   0,   5,   0,   0, 186, 214, 117, 246, 219, 242,  63,  63,
        63,  63,  63,  63, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 240,   0,   0, 255,   0,
         0, 240,   0,   5,   0,   0,   6,   0,  70, 182, 134, 198,  22,
        38,  54,  70,  88,   5, 146,   0,   7,   0,   0,   0,   0,   0,
         0, 241,   0, 242,   0, 242,  25, 241,   5, 242,   1,   0, 255,
         0, 227,  71,   0,   0,   0, 127,   0, 113,   0,   1,   0,   1,
         0,   0,   8,   0, 255, 255, 255, 255, 255, 255, 255,   0,   0,
         5,   0,   0, 186, 214, 117, 246, 219, 242], dtype=uint8), False)]

    Taking the second step
    
        state     : 
    [ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
     255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
     255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
       0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   6
     146   0   8   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
       0   0 255   0 228   0   0   0   0   0   0   0   0   0   0   0   0   0
       8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
     219 242  63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255
     255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
     192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0
     255   0   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70
      88   5 146   0   7   0   0   0   0   0   0 241   0 242   0 242  25 241
       5 242   1   0 255   0 227  71   0   0   0 127   0 113   0   1   0   1
       0   0   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214
     117 246 219 242]
        nextState : 
    [ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
     255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
     255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
       0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  88   5
     146   0   7   0   0   0   0   0   0 241   0 242   0 242  25 241   5 242
       1   0 255   0 227  71   0   0   0 127   0 113   0   1   0   1   0   0
       8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214 117 246
     219 242  63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255
     255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
     192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0
     255   0   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70
      82   4 146   0   6   0   0   0   0   0   0 241   0 242   0 242  25 241
       5 242   2   0 255   0 226  70   0   0   0 126   0 114   0   1   0   1
       0   0   8   0 255 255 255 255 255 255 255   0   0   5   0   0 186 214
     117 246 219 242]
        action    : 2
        done      : False

1.3.1. Let us play an entire episode

Note that the Breakout takes a Discrete(4) pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a randoom action ...

name = 'Breakout-ramNoFrameskip-v4'

policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env1D(name, N=2, showEnv=False) as env:
    
    result = env.episode(policy, 10)[0]
    for r in result:
        state, action, reward, nextState, done = r
        print(f'reward = {reward}, action = {action}, done = {done}')
        print((nextState - state).sum())
    
    print(f'final state:\n{state}')
    print(f'final nextState-state:\n{nextState-state}')

reward = 0.0, action = 3, done = False
772
reward = 0.0, action = 1, done = False
2360
reward = 0.0, action = 2, done = False
3116
reward = 0.0, action = 3, done = False
2559
reward = 0.0, action = 2, done = False
2558
reward = 0.0, action = 2, done = False
2807
reward = 0.0, action = 1, done = False
2057
reward = 0.0, action = 1, done = False
1799
reward = 0.0, action = 2, done = False
2046
reward = 0.0, action = 2, done = False
2045
final state:
[ 63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0 255   0
   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70  82   4
 150   0   2   1   0   0   0   0   0 241   0 242   0 242  25 241   5 242
   8   0 255   0 220  65   0   0   0 135   0 119   0   1   0 255   0   0
   8   0 255 255 255 255 255 255 255   0   0   4   0   0 186 214 117 246
 219 242  63  63  63  63  63  63 255 255 255 255 255 255 255 255 255 255
 255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
 192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240   0   0
 255   0   0 240   0   5   0   0   6   0  70 182 134 198  22  38  54  70
  76   3 151   0   6   2   0   0   0   0   0 241   0 242   0 242  25 241
   5 242   9   0 255   0 219  64   0   0   0 136   0 120   0   1   0 255
   0   0   8   0 255 255 255 255 255 255 255   0   0   4   0   0 186 214
 117 246 219 242]
final nextState-state:
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 250 255
   1   0   4   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   1   0   0   0 255 255   0   0   0   1   0   1   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
 252 255   4   0 252   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   1   0   0   0 255   0   0   0   0   1   0   1   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0]