Reinforcement Learning approach - archetag/smart-water-quality-monitoring GitHub Wiki

Wiki: Reinforcement Learning approach

All paths calculated with previous algorithms offer solutions, but not necessarily the optimal ones. Further approaches such as Q-Learning and Deep Q-Learning algorithms could provide more interesting long-term strategies for the boat.

Table of contents

1. Q-Learning principles
2. Reinforcement learning applied to an autonomous sailboat
3. Improvement of the algorithm

Material needed

  • Python 3.6
  • Modules used: matplotlib, scipy, numpy, math

You can use Python 2.7 but some characters may be not recognized by this version (Greek alphabet).

1. Q-Learning principles

The choice of Reinforcement Learning

Main Machine Learning algorithms can be divided into three major groups, depending on the task asked:

  • Supervised Learning (task driven): for classification into groups and prediction of future values.
  • Unsupervised Learning (data driven): for undirect identification of clusters or patterns.
  • Reinforcement Learning (learn from mistakes and reward): to understand how to act in a given environment.

In path planning problems, evaluating a route with a score makes sense (length or obstacle collisions), the optimal path is not always known, and finding general rules for performing a good path, with directions alone at disposal, in an unknown environment seems complicated to be implemented. These reasons naturally lead to Reinforcement Learning.

For convenience, algorithms based on Q-Learning and Deep Q-Learning will be used.

Q-Learning philosophy

Reinforcement learning involves an agent , a set of states , and a set of actions per state. By performing an action , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). Q refers to the function that returns the reward used to provide the reinforcement and can be said to stand for the quality of an action taken in a given state.

For a finite Markov decision process (here, some possible positions), Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over all successive steps, starting from the current state.

Q-Learning algorithm

When Q-learning is performed, we create what’s called a Q-table or matrix that follows the shape of [state, action] and we initialize our values arbitrarily. Then the Q-values are updated and stored after an episode. This Q-table becomes a reference table for our agent to select the best action based on the Q-value.

  • , the random explorer (either to avoid loops and encourage exploration, or to encourage exploitation of correct behaviours)
  • e the horizon (number of total trainings)

At each step, the Q-Table is updated according to this Bellman's equation:

where:

  • is the learning rate (how much important is the new rewards compared to the previous ones)
  • is the discount factor (to balance immediate and future reward)

The better these hyperparameters (, e, , ) are calibrated, the more efficient the algorithm.

2. Reinforcement learning applied to an autonomous sailboat

This section is inspired by the strategy proposed by A. Junior, D. Henrique dos Santos, A. Fernandes de Negreiros, J. Boas de Souza Silva and L. Gonçalves in High-Level Path Planning for an Autonomous Sailboat Robot Using Q-Learning

Definition of main actors

  • The Q-Learning agent corresponds exactly to the same sailboat model as depicted in the Modelling for the Travelling Salesman Problem wiki, used for static algorithms. However, here, only general dynamics will be considered: three degrees of freedom in a 2D plane, waves, currents, and gravitational forces neglected, and the presence of a no-go zone.
  • The authors have chosen to represent the zone of measurement as a discretized environment, instead of a graph. This directly allows a simple strategy for avoiding the obstacles and borders (by discretizing them too). Each cell of the new environment is numbered to be associated with a unique state.
  • Representing the map as a grid restrains the number of possible actions to 8 directions. Some of these actions must be removed to acknowledge the no-go zone due to the wind.
  • Finally, the authors have imagined a reward matrix based on the inverse distance between cells of the environment. All distances are calculated for each state from the starting point (a), then inverted and added with the maximum found distance to obtain a score (b). In this way, the starting point has the high reward of the grid. Cells with obstacles have a score of -1000.

The current objective with these definitions is to make two-points paths. Other strategies should be envisaged for more points as in the Traveling Salesman Problem.

Python Script

To implement the previous method and test configurations:

  1. Open the QLearning.py and QLearning_plot.py python scripts in the PathPlanning folder of the GitHub project.
  2. Go to the customizable area section to adjust parameters at your will:
    • Configure the wind parameters
    • Choose the starting and ending point (be sure they are not obstacles or borders)
    • Input the discretized map (0 for normal cells, -1 for obstacles and borders)
    • Specify the hyperparameters for the algorithm (, e, , )
  3. Enable graphical tools for analysis.
    • Set the boolean plotEachStep to true to inspect the general behavior of the boat (eg. if it explores enough or too much during first steps).
      Warning: the algorithm will take much more time to reach the end
    • Set the boolean plotEachStep to false to obtain graphical results only once the training is done.
    • Choose the maximum qmax and minimum qmin values to be shown on the Q-Table (to better distinguish the emerging trends). Find below an example of two Q-Tables set up with different limit values. The second one offers many more shades of values, so is better for further analysis.
  4. Choose the reward system by uncommenting either distance-based reward system, or classic reward system.
  1. Interpret with the graphical tools
    • Find below: the graphical tool at step 0 (left) and when training is done (right).
    • The map has a color code for each element: blue for the sea, gray for obstacles and borders, green for starting and ending point, and orange for the path found by the boat (available when each step is plotted). Each cell is labeled with its associated state. If an optimal solution has been found, the final path will appear on the map.
    • The Q-Table has all its values initialized with certain values (see Different initialization section in Part 3). The abscissas represent the possible actions for the sailboat, the ordinates all the states corresponding to the different cells of the map. You can identify forbidden actions (due to wind) since their column will not evolve (see above: North and North-East moves cannot be performed). Grey cells of the Q-Table generally refers to the obstacles.
    • The accumulated reward graph permits to estimate how many episodes have been performed (abscissa), and to observe the convergence towards the solution.

Influence of hyperparameters

The aim of the tables below is to give a general idea of the behaviors noticed. All images in a much bigger size are stored in the folder pictures/machine_learning of the GitHub project.

  • Learning rate ( )

    Here are some different learning rate values for fixed values of discount factor (0.7), random explorer (0.3) and the number of episodes (10000). For convenience, the range value of ordinates for the reward graph remains the same along tests:

0 0.2 0.4 0.6 0.8 1

From observation, the learning rate value has a direct impact on the behavior of the convergence. must be strictly positive to observe changes. In this case, the higher the value, the quicker the convergence, the smaller the cumulative score at step 10000. Surprisingly, little changes between each Q-Table are observable (eg. some obstacle cells are not always exploited), but a general trend persits and the maximum values of the Q-Table stays the same. Consequently, knowing that cumulative score diminishes with higher values of alpha for the same maximum values found in Q-Table, we may suppose that when is close to one, the distribution of rewards is more accentuated on the extremes.

  • Discount factor ( )

Here are some different discount factor values for fixed values of learning rate (0.8), random explorer(0.3) and the number of episodes(10000):

0 0.2 0.4 0.6 0.8 0.95

From observation, the discount factor has an impact on values stored in the Q-Table, but not on the behavior of the convergence (except if it equals 1, which leads to a divergence). The higher the value of the discount factor, the higher the maximal values stored in the Q-Table, the higher the mean of values of Q-Table. In other words, if the discount factor is close to one, a positive reward influences a bigger group of cells in the Q-Table (long-term reward strategy). Otherwise, a positive reward influences only the cell corresponding to the previous state (instant reward strategy).

3. Improvement of the algorithm

Test of the reward systems for two points

Instead of simply being a combination of neutral, positive and negative rewards, the reward matrix used in High-Level Path Planning for an Autonomous Sailboat Robot Using Q-Learning relies on a various set of reward values, determined by a climb-gradient-like method. Is this method more relevant than classic ones? Is this method convenient for problems with more than two cities?

  • Different reward matrices

The reward matrix presented in the publication (left) seems to encourage an instant reward strategy (discount factor close to 0), whereas the classical one (right) is designed to find the shortest path by penalizing all normal cells and to encourage long-term strategies thanks to its unique positive cell (discount factor close to 1).

  • Different initializations

Initializing all Q-Table values to zeros allows obtaining a solution with the classic reward matrix, whatever the discount factor. However, in order to use the gradient matrix reward, the Q-Table must be set up differently: all the values in the row corresponding to the goal's state must be higher than a certain value (depending on the map, and distance from the starting point). This can be explained by the Bellman's equation that updates the Q-Table:

For the gradient reward matrix, just before reaching the goal, the condition must be verified to access the targeted cell. If all Q-Table values first equal 0, the condition becomes , which is not always exact over time as is never updated at last step. This may be inconvenient for future experiements.

  • Convergences

The two results use the following hyperparameters: = 0.3, episodes = 10000, = 0.8, = 0.7.

Using the classic reward matrix (right) tends to make the algorithm converge more quickly. Both of them result in the same optimal solution for this map and configuration. However, with different hyperparameters, the gradient reward matrix (left) does not always propose a solution (because of infinite loops due to similar values in the Q-Table). By analyzing the final appearance of the Q-Tables, it seems that values created by the "gradient" rewarding are quite gradual and continuous, while those of the "classic" rewarding look more heterogeneous. This may reflect a confidence in the path found among other possibilities.

Further comparison between the reward systems for two points

In the High-Level Path Planning for an Autonomous Sailboat Robot Using Q-Learning paper, the authors have organized a series of different scenarios to test the robustness of their model. These scenarios can be reused to compare further step by step the two reward systems. The same hyperparameters as described in the paper for each scenario will be used. Results from the gradient reward system are on left, from classic reward system on right.

  • Scenarios 1a/1b (area without obstacles with fixed wind direction, for different starting and ending points)

For scenarios 1a and 1b, the two reward systems converge towards the same solutions, the classic reward system is quicker though.

  • Scenarios 2a/2b (area with obstacles with fixed wind direction, for different starting and ending points)

For scenarios 2a and 2b, with additional obstacles in their zone, the two reward systems nearly converge towards the same solutions. As described in the paper, the convergence takes more time to be reached than in scenarios with no obstacles.

  • Scenarios 3a/3b/3c (area without obstacles with different wind directions, for same starting and ending points)

Concerning the last three scenarios, the two reward systems do not converge towards the same solution, but the score of each final path is the same in each case. Maybe, here, the classic reward system (on right) suggests paths with fewer zigzag lines, that could be more easily performed by the boat. However this cannot be generalized as this system could have found as well the same paths as the other one (because of their identical score).

This further comparison has finally led to the same observations as the previous one (Q-Table final appearance, convergence speed). For nearly identical results, the classic reward system could replace the gradient one if convergence speed is a crucial criterion.

Test of the reward systems for three points and more

Taking into consideration more than two points on the map forces to update the two previous reward systems. Ideally, the optimal shortest path should go through every measurement point only once. By calculating the rewards accordingly to each interesting point's position, this condition is not verified: the boat can return to the same point, again and again, to maximize its score while ignoring the other points to discover. One possible solution is to update the reward matrix after having visited a point.

  • A dynamic reward matrix

During the Q-Learning algorithm is running, the reward matrix can be updated as follow:

0. Initialization 
   * R0: all points are equally considered to prevent any influence on the solution.
1. Update for one episode
   * R = R0
   * While all points have not been visited:
     * If a point P is reached:
       * R = R(points - P)  # the reward matrix is recalculated with the remaining points
       * points = points - P  # the remaining points to visit

This updating method can be applied to both previous reward systems. An example of initialization with points in green (starting position is not consider in reward matrices):

  • Implementation
  1. Open the QLearningCities.py and the QLearning_plotCities.py python scripts in the PathPlanning folder of the GitHub project.
  2. In the name=='main' section, as for the two points configuration, configure at your will the parameters. This time, you can add several points in the goal variable.
  3. Choose and uncomment the reward system.
    Warning: The hyperparameters chosen for the two points configuration may not work here.
  • Analysis of the method

Whereas the static reward system offered systematically a solution for a wide range of hyperparameters, the updating strategy seems to be more sensitive to certain specific values. The classic reward system needs a low learning rate ( = 0.3) and a high discount factor ( = 0.7), but does not provide a solution regularly. On the contrary, the gradient reward system proposed by the authors needs a higher learning rate ( = 0.7), a lower discount factor ( = 0.3), and frequently provides a solution for a large number of episodes. This time, the two systems do not converge exactly towards the same solution. Here are the two results (gradient on left, classic on right):

From observations, the classic reward system could converge less frequently because of the appearance of random noise with time. This perturbation may come from updating the reward matrix and can confuse the agent. Another source of error occurs with both algorithms because it is impossible to force the order in which each city is discovered. One time, the path will begin by point A and finish on point B, another time it will be the opposite. Considering the wind in addition, the algorithms generally tend to find a "blurred path" close to the solution with some points of interest missing in the optimal found path.

As such, the gradient reward system imagined in the paper seems more reliable for this issue. Cases with more points have been tested, but the results are less convincing and take even more time for a single training (more than 5 minutes for 10K steps).