Experience to Policy Pipeline - HU-ICT-LAB/RobotWars GitHub Wiki

Experience to Policy Framework

In order to facilitate the development of policies from the experience database, a proper pipeline - facilitated by a framework - must be realised.

In a literature search, we found a paper[1] that went in depth about a possible reinforcement learing pipeline. We took several elements from this research - in particular the prioritised replay buffer - for use in our own version of the pipeline. It has been processed into a diagram and is visible below.

VERSION 1:

https://github.com/HU-ICT-LAB/RobotWars/blob/feature/91-workflow/wiki_pictures/workflow_diagram.png

Later on during the project, we opted to use Stable Baselines 3 in order to obtain useful classes (namely for the chosen algorithm, SAC), and other functionality which allowed us to initiate the pipelinemore easily. The Diagram has been updated accordingly and is visible below.

VERSION 2:

https://github.com/HU-ICT-LAB/RobotWars/blob/feature/91-workflow/wiki_pictures/pipeline_diagram_v2.png

With the pipeline realised, a set of conceptual pseudocode was written to realize the framework which will power the pipeline.

Pseudocode Training Pipeline

# AI-Hub functions

def data_selection(samplesize: int, seednum=None):
	"Obtain a set of randomly chosen rows from the experience dataset"
	connect with AWS MySQL
	if seed was set:
		stel random generator in AWS in op seed.
	request x rows from AWS MySQL, ordered randomly where X is samplesize, from designated experience table.
	apply selections  # Data filtering, static and in func (for now)
	return rows in list

def data_evaluation(rewardfunc: callable, data:list):
	"Assigns a reward to each row obtained from data selection using a reward function for the state the row has"
	copy data into new var, data2
	map rewardfunc for each row in data2
	return data2

def training_phase_one(evaldata: list, policy: object, lossfunction: callable):
	"Training phase which does one episode to establish initial prioritised replay buffer and model weights"
	for each row in prioritisedevaldata:
		get actionoutput from policy using current row state
		step in the environment using state and action, get new state and reward
		get loss by calling lossfunction
		set loss in current row with loss obtained by lossfunction
		apply backpropagation on policy with loss
	return policy and prioritisedevaldata
	
def training_phase_two(prioritisedevaldata: list, policy: object, lossfunction: callable, episodes: int):
	"Training phase which keeps iterating multiple times."
	for each episode:
		for each row in prioritisedevaldata:
			get actionoutput from policy using current row state
			step in the environment using state and action, get new state and reward
			get loss by calling lossfunction
			set loss in current row with loss obtained by lossfunction
			apply backpropagation on policy with loss
	return policy and prioritisedevaldata
	
def input_policy(policy: object):
	"Writes a given policy to the AWS MySQL data storage"
	connect to AWS MySQL
        write policy into designated table while incrementing version number
	
def send_policy():
	"Returns the newest policy with the highest version number"
	connect to AWS MySQL
	request policy with highest version number
	return policy

The implementation of the pseudocode may change in future user stories.

Sources

[1]: Gur, Y., Yang, D., Stalschus, F., & Reinwald, B. (2021). Adaptive Multi-Model Reinforcement Learning for Online Database Tuning. In EDBT (pp. 439-444).

Related Issues

Issues: #91