Reinforcement Learning Pipeline Reward Function - HU-ICT-LAB/RobotWars GitHub Wiki

The problem

Within reinforcement learning, environments which the agent is trained in utilize a reward function in order to provide feedback on an agent's actions. In our context however, we only ever have robot and event data on a given time step. While processing this data into a suitable format (e.g. state, action, nextstate, done), we must have a method to rate the resulting format in order to add the reward onto the format.

The solution

Several sources point out that there is no strict requirement for a reward function[1][2], other than rating if the action of a agent is beneficial or not, however; it is generally advised to fine-tune the rewards from a reward function based on what the agent does in a environment. As such, the reward function and events described here may change in future iterations.

Version 1

We currently have the following data:

  • Event data [Robot being hit, Robot shooting another robot, general game events]
  • Robot data [Orientation, shooting, camera feed, etc.]

One proposition for the reward function would be to use the data (state) from the current and next timestep (nextstate) in order to generate a reward for the format. As there are multiple environment influences (e.g. a robot attacking another), we also need to be careful with assigning the rewards for each event.

The current events that are possible to be used are:

  • Robot getting hit [determined by event and robot data]
  • Robot hitting another robot succesfully [determined by event and robot data]
  • Robot shooting while missing [determined by event and robot data]

Each event should be judged as follows respectively, including a short argumentation;

  • -10, the robot must be taught to evade gunfire from enemy robots.
  • +10, succesfully hitting a robot should be rewarded, as it is one of the goals of having an autonomous robot.
  • -1, we want the robot to shoot as little as possible, as it makes little sense.

Pseudocode

Using the data and goals outlined below, a function for realizing this endgoal could be achieved, detailed below. Minor abstraction has been implemented in the pseudocode below to allow for a degree of freedom when organising the state tuple. In an actual implementation these rules should be replaced with proper bool checking on indices of the state tuple.

def rewardfunction(state):
    if state[self_hit] and state[event_robot_hit]:
        return -10
    if state[firing] and state[event_robot_hit]:
        return +10
    if state[firing] and not state[event_robot_hit]:
        return -1

Sources

[1] How to make a reward function in reinforcement learning? (2016, January 3). Cross Validated. Retrieved January 24, 2022, from https://stats.stackexchange.com/questions/189067/how-to-make-a-reward-function-in-reinforcement-learning

[2] Samples of Reward Functions for AWS DeepRacer. (2022, January 13). Linkedin. Retrieved January 24, 2022, from https://www.linkedin.com/pulse/samples-reward-functions-aws-deepracer-bahman-javadi