Tutorial_Env1D - sankhaMukherjee/RLalgos GitHub Wiki
This is a
.md
version of this Notebook
cd ../src
/home/sankha/Documents/programs/ML/RLalgos/src
envGym
1. envGym
is a wrapper around the OpenAI Gym environment. It has several methds that can be useful for using the environment. In this notebook, we shall explore this environment.
from lib.envs import envGym
from time import sleep
import numpy as np
import torch
Env1D()
context manager
1.1. The The Env
class exposes a context manager that will allow this environment to generate an OpenAI environment within. The first time this context manager is entered, this creates a current state self.state
. This can be reset with the self.reset()
method. This introduces a new input parameter N
that will allow you to specify how many earlier states you want to consider to determine a single state.
name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env1D(name, showEnv=False, N=2) as env:
print(f'Initial environment state:\n {env.state}')
env.reset()
print(f'Initial environment state after reset:\n {env.state}')
Initial environment state:
deque([array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8)], maxlen=3)
Initial environment state after reset:
deque([array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8)], maxlen=3)
self.env
property
1.2. The The self.env
property points to the OpenAI Gym environment. Use this for any of the OpenAI Gym methods. However, it is best not to use the internal environment directly. It would be much more preferabe to update this environment to create a new method for this specific class.
name = 'Breakout-ramNoFrameskip-v4'
with envGym.Env1D(name, showEnv=False) as env:
print(f'Check environments action space: {env.env.action_space}')
env.env.render()
sleep(2)
env.env.close()
Check environments action space: Discrete(4)
1.3. Stepping and playing
The Env
class provides two main methods to interact with the environment - a self.step(policy)
and an selef.episode(policy, maxSteps)
. For making this environment compatible with the envUnity
environment, which can simulate more than a single actor at a single time, this policy is a function that should return a number of actions, one for each actor. For this reason, the result of a policy should always be a list of actions. For the envGym
environment, this will mean a list with a single action.
1.3.1. Let us take a couple of steps
Note that the Breakout takes a Discrete(4)
pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a random action. It is worth noting that the size of the state is twice the size that will be obtained from using Env
rather than Env1
name = 'Breakout-ramNoFrameskip-v4'
policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env1D(name, N=2, showEnv=False) as env:
print('Taking the first step')
result = env.step(policy)
print(result)
print('\n\nTaking the second step')
result = env.step(policy)[0]
state, action, reward, nextState, done = result
print(f'''
state : \n{state}
nextState : \n{nextState}
action : {action}
done : {done}
''')
Taking the first step
[(array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242, 63, 63,
63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0, 255, 0,
0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134, 198, 22,
38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0, 0, 0,
0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0, 0, 255,
0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255, 0, 0,
5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), 1, 0.0, array([ 63, 63, 63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0,
255, 0, 0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134,
198, 22, 38, 54, 70, 88, 6, 146, 0, 8, 0, 0, 0,
0, 0, 0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 0,
0, 255, 0, 228, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255,
0, 0, 5, 0, 0, 186, 214, 117, 246, 219, 242, 63, 63,
63, 63, 63, 63, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 192, 192, 192, 192, 192, 192, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255, 240, 0, 0, 255, 0,
0, 240, 0, 5, 0, 0, 6, 0, 70, 182, 134, 198, 22,
38, 54, 70, 88, 5, 146, 0, 7, 0, 0, 0, 0, 0,
0, 241, 0, 242, 0, 242, 25, 241, 5, 242, 1, 0, 255,
0, 227, 71, 0, 0, 0, 127, 0, 113, 0, 1, 0, 1,
0, 0, 8, 0, 255, 255, 255, 255, 255, 255, 255, 0, 0,
5, 0, 0, 186, 214, 117, 246, 219, 242], dtype=uint8), False)]
Taking the second step
state :
[ 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0 255 0
0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70 88 6
146 0 8 0 0 0 0 0 0 241 0 242 0 242 25 241 5 242
0 0 255 0 228 0 0 0 0 0 0 0 0 0 0 0 0 0
8 0 255 255 255 255 255 255 255 0 0 5 0 0 186 214 117 246
219 242 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0
255 0 0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70
88 5 146 0 7 0 0 0 0 0 0 241 0 242 0 242 25 241
5 242 1 0 255 0 227 71 0 0 0 127 0 113 0 1 0 1
0 0 8 0 255 255 255 255 255 255 255 0 0 5 0 0 186 214
117 246 219 242]
nextState :
[ 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0 255 0
0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70 88 5
146 0 7 0 0 0 0 0 0 241 0 242 0 242 25 241 5 242
1 0 255 0 227 71 0 0 0 127 0 113 0 1 0 1 0 0
8 0 255 255 255 255 255 255 255 0 0 5 0 0 186 214 117 246
219 242 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0
255 0 0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70
82 4 146 0 6 0 0 0 0 0 0 241 0 242 0 242 25 241
5 242 2 0 255 0 226 70 0 0 0 126 0 114 0 1 0 1
0 0 8 0 255 255 255 255 255 255 255 0 0 5 0 0 186 214
117 246 219 242]
action : 2
done : False
1.3.1. Let us play an entire episode
Note that the Breakout takes a Discrete(4)
pytorch tensor. Note allso that this returrns a result per actor (only one in this case). We shall specify a randoom action ...
name = 'Breakout-ramNoFrameskip-v4'
policy = lambda m: [torch.randint(0, 4, (1,))]
with envGym.Env1D(name, N=2, showEnv=False) as env:
result = env.episode(policy, 10)[0]
for r in result:
state, action, reward, nextState, done = r
print(f'reward = {reward}, action = {action}, done = {done}')
print((nextState - state).sum())
print(f'final state:\n{state}')
print(f'final nextState-state:\n{nextState-state}')
reward = 0.0, action = 3, done = False
772
reward = 0.0, action = 1, done = False
2360
reward = 0.0, action = 2, done = False
3116
reward = 0.0, action = 3, done = False
2559
reward = 0.0, action = 2, done = False
2558
reward = 0.0, action = 2, done = False
2807
reward = 0.0, action = 1, done = False
2057
reward = 0.0, action = 1, done = False
1799
reward = 0.0, action = 2, done = False
2046
reward = 0.0, action = 2, done = False
2045
final state:
[ 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192 192 192
255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0 255 0
0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70 82 4
150 0 2 1 0 0 0 0 0 241 0 242 0 242 25 241 5 242
8 0 255 0 220 65 0 0 0 135 0 119 0 1 0 255 0 0
8 0 255 255 255 255 255 255 255 0 0 4 0 0 186 214 117 246
219 242 63 63 63 63 63 63 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255 255 255 192 192 192 192
192 192 255 255 255 255 255 255 255 255 255 255 255 255 255 240 0 0
255 0 0 240 0 5 0 0 6 0 70 182 134 198 22 38 54 70
76 3 151 0 6 2 0 0 0 0 0 241 0 242 0 242 25 241
5 242 9 0 255 0 219 64 0 0 0 136 0 120 0 1 0 255
0 0 8 0 255 255 255 255 255 255 255 0 0 4 0 0 186 214
117 246 219 242]
final nextState-state:
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 250 255
1 0 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 255 255 0 0 0 1 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
252 255 4 0 252 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 255 0 0 0 0 1 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]