Program - mateuslevisf/xrl-pucrio GitHub Wiki

Project execution

The flow of the project can be described as such:

After the user executes the program (either through an input file or the CLI), input arguments are parsed to generate the program configuration to be ran.
If any parameters are missing, these are completed during the parsing.
Once arguments are parsed, the chosen environment and agent are instantiated in this order - since agents might need the observation and action space shapes to be properly configured. The final program configuration (determined after parsing) is printed in the terminal window which is running the program.
The basic RL training loop is executed. The loop is interrupted after a fixed number of steps until the program completes all episodes for evaluation periods, in which training is stopped and evaluation results are saved in order to gauge model/agent performance.
If the technique being ran is H-Values/Belief Map, it is executed during agent training.
After the end of training, the model is saved.
If the technique being ran is VIPER, the program starts training the Decision Tree Agent using an imitation learning loop. An image is generated for the resulting decision tree.
The program saves ending evaluation plots along with other results and ends execution.

Results

Every program execution generates an "evaluation results" plot:

Blackjack-v1_evaluation_results

The above plot can be interpreted as the mean performance of the agent throughout the training. Each value is generated by interrupting training in order to run 50 episodes where the agent uses the policy learned up to that point in order to select the actions it considers optimal, then calculating the mean reward obtained during that entire evaluation period. The evaluation interval can be configured by the user. Ideally, the values should become higher with as training episodes go, which can be interpreted as the agent learning what is optimal for that environment.

Running the H-Values technique generates Q-Values and H-Values table plots:

The above Q-Values plot was generated for the Blackjack environment. It represents the expected discounted reward values learned by the agent after training for a given state and action pair. In the case of the Blackjack environment, the state is defined by the value of the player's hand and the value of the dealer's hand (that is, the card the dealer is showing). Each plot represents an action and if the "usable ace" is available or not. For more information about the Blackjack environment, see Gymnasium documentation linked in the "Description" section.

The H-Values plot is specific to the Blackjack environment; it represents what the agent expects to happen after a given state and action. A high "h-value" for a given state represents that the agent expects to see that state in later runs or after taking an action. For more information, see the Belief/H-Values paper.

Running the VIPER technique generates an image of the resulting Decision Tree as a .png file. The Decision Tree generated by the VIPER technique is learned by applying imitation learning techniques on a trained RL agent; the DT generated by that process aims to have a similar policy to the original agent while being more easily interpretable. See the VIPER paper for more information.

DQN-based agents are saved as .pyt (PyTorch) files.