Experience to Policy Pipeline - HU-ICT-LAB/RobotWars GitHub Wiki
Experience to Policy Framework
In order to facilitate the development of policies from the experience database, a proper pipeline - facilitated by a framework - must be realised.
In a literature search, we found a paper[1] that went in depth about a possible reinforcement learing pipeline. We took several elements from this research - in particular the prioritised replay buffer - for use in our own version of the pipeline. It has been processed into a diagram and is visible below.
VERSION 1:
https://github.com/HU-ICT-LAB/RobotWars/blob/feature/91-workflow/wiki_pictures/workflow_diagram.png
Later on during the project, we opted to use Stable Baselines 3 in order to obtain useful classes (namely for the chosen algorithm, SAC), and other functionality which allowed us to initiate the pipelinemore easily. The Diagram has been updated accordingly and is visible below.
VERSION 2:
With the pipeline realised, a set of conceptual pseudocode was written to realize the framework which will power the pipeline.
Pseudocode Training Pipeline
# AI-Hub functions
def data_selection(samplesize: int, seednum=None):
"Obtain a set of randomly chosen rows from the experience dataset"
connect with AWS MySQL
if seed was set:
stel random generator in AWS in op seed.
request x rows from AWS MySQL, ordered randomly where X is samplesize, from designated experience table.
apply selections # Data filtering, static and in func (for now)
return rows in list
def data_evaluation(rewardfunc: callable, data:list):
"Assigns a reward to each row obtained from data selection using a reward function for the state the row has"
copy data into new var, data2
map rewardfunc for each row in data2
return data2
def training_phase_one(evaldata: list, policy: object, lossfunction: callable):
"Training phase which does one episode to establish initial prioritised replay buffer and model weights"
for each row in prioritisedevaldata:
get actionoutput from policy using current row state
step in the environment using state and action, get new state and reward
get loss by calling lossfunction
set loss in current row with loss obtained by lossfunction
apply backpropagation on policy with loss
return policy and prioritisedevaldata
def training_phase_two(prioritisedevaldata: list, policy: object, lossfunction: callable, episodes: int):
"Training phase which keeps iterating multiple times."
for each episode:
for each row in prioritisedevaldata:
get actionoutput from policy using current row state
step in the environment using state and action, get new state and reward
get loss by calling lossfunction
set loss in current row with loss obtained by lossfunction
apply backpropagation on policy with loss
return policy and prioritisedevaldata
def input_policy(policy: object):
"Writes a given policy to the AWS MySQL data storage"
connect to AWS MySQL
write policy into designated table while incrementing version number
def send_policy():
"Returns the newest policy with the highest version number"
connect to AWS MySQL
request policy with highest version number
return policy
The implementation of the pseudocode may change in future user stories.
Sources
[1]: Gur, Y., Yang, D., Stalschus, F., & Reinwald, B. (2021). Adaptive Multi-Model Reinforcement Learning for Online Database Tuning. In EDBT (pp. 439-444).
Related Issues
Issues: #91