Deep Q Network - pedroprates/navigation-project GitHub Wiki

This project has been developed based on the original DQN paper. We have implemented both the experience replay and iterative update, but it does not run on raw pixels, as the paper's does, instead the state is based on the values that the environment returns (a 37 shaped vector), which has allowed us to train a MLP (Multi-Layer Perceptron) on CPU with reasonable time.

Our DQN is a MLP with only 2 layers, the first one receives the state vector and has 64 neurons, the second one also with 64 neurons and the last one outputs the action that should be performed. This network is used to map a Q-Table for this task, where it'll always output the best action to be made on a particular state. It's often used a neural network when states are not discrete (or cannot be discretise because, otherwise would have a huge domain). It is also used an Epsilon-Greedy algorithm to choose which actions to make, where we choose the best predicted action (based on the DQN output) with an epsilon percent of chance, otherwise we'll pick a random action, this helps the algorithm to exploit the domain, and therefore could leave to better results on a longer term.

The training method uses the Unity Environment to receive the current state and the next state (after taking an action), and it actually uses the output of the chosen action to learn the best actions to be performed. To backpropagate through the possible actions, it is need to have two separate networks, one that is the baseline network (where we will compare our improvements with) and another one, that is the actually trainable network. We use an interactive update that adjusts the action-value towards target values that are only periodically update (that means, we only set the trained weights on the target network periodically).

To avoid the correlations in the observation sequence, we've also used the experience replay, that randomizes over the data (over the episodes) and thereby remove correlations in the observation sequence smoothing over changes in the data distribution.

Both of the methods mentioned above are fully described on the DQN Paper.

Deep Q-Learning

The Deep Q-Learning algorithm is based on Q-Learning, but instead of using a static Q-Table the algorithm uses a neural network (on our case a Multi-Layer Perceptron). The Q-Table is represented by two structure-identical MLPs, one to be used as target (the current best action values on a given state) and the second one, named local, that will be used as the MLP to be trained (which the weights will be updated), after a few episodes (the number of episodes is a hyperparameter) the target network receives all the local network weights.

The training happens as following:

  1. First, we generate a lot of different state, action, reward and next_state tuples and store them on a class called ReplayBuffer. This class is responsible for handling the Experience Replay (as described above), it will add enough tuples on a single ReplayBuffer.

  2. We'll sample state, action, reward and next_state from the ReplayBuffer and feed it to the MLP. The training process will use an epsilon-greedy exploration technique to choose either the predicted action from the MLP or a randomly chosen action. This epsilon-greedy is responsible for the exploration process, and its value is variable, being the highest on the beginning (more probable to choose a random action than a chosen action from the MLP) and the epsilon value decreases as the model trains for more episodes (being more reliable).

  3. The local network is trained on the output of the Unity Environment, with a goal to maximize the expected future reward. Their weights are being updated through the backpropagation method, getting the expected action-value as the target.

  4. Over a couple of iterations (and weight-updates) the target network will receive the weights from the local network. On this work, this "copy" is actually a soft update, it means, the weights are a weighted sum of the local and target networks weights.

We repeat this steps over a big number of episodes to fully train the network!