Reinforcement Learning - OKStateACM/AI_Workshop GitHub Wiki

Reinforcement Learning

Reinforcement is a large topic, that covers many difference disciplines and fields. And Reinforcement learning is just one small part of artificial intelligence as a whole. That just goes to show how large the topic of artificial intelligence is. If you are interested in reinforcement learning, or become interested after going through some examples, I would encourage you to read some or all of the book Reinforcement Learning: An introduction by Richard Sutton, which is available online. It covers a very wide range of topics related to RL (reinforcement learning)

Ok, so what is Reinforcement Learning??

Reinforcement algorithms generally have an actor that takes actions within some environment. This can be a robot navigating a room, or a computer playing in the game world of World of Warcraft, or even working within the environment of a chess game. At every step of the way, the actor performs an action, and is given some reward signal to indicate whether it is doing well or not. In a video game, this may be your current score. In a game like chess, the reward signal may be given at the end of the game based on a win or a loss. The environment can be generalized as some sort of state. In chess, this state may be the current layout of pieces on the board. In a video game, it could be represented as all the possible placement of the various elements in the game world. As the actor takes actions, the world state changes.

In other demonstrations, we've generally had a supervised learning algorithm. In other words, when our network made a prediction, we knew if it was correct, or how badly it did, and we used that information to update our network accordingly. There has been some expert supervision that can attest to whether our program has produced correct values. With the MNIST example, this was encoded in the dataset itself. Every image was labeled exactly what digit it contained, so after our network guessed which digit it saw, we could train it to do better. However, in many problems we don't always know if the output of our network is a good or bad until much later. For instance, video games are a classic example for RL. If we have an AI opponent in a game, it may be making hundreds of actions per minute in games like Starcraft, for instance. However, its performance is only judged by whether or not it wins the game. If it wins, it must have done something right. If it loses, it must have made some bad moves. The trouble is, we don't know which moves were good or bad. If it was playing go or chess, it may have made a brilliant opener, but failed to follow through later in the game. This is a problem that newer algorithms such as Deep Mind's algorithm in Alpha Go attempts to address.

What makes RL different from many other disciplines in AI research is that it is very goal oriented. Many valuable problems can be phrased as having an actor in some environment attempting to reach some goal, which it is rewarded for. Note the emphasis on key phrases. These elements more or less make up RL.

How does it work?

There are countless reinforcement algorithms, that work in a variety of ways. Some of them use neural networks, many of them don't. We will look at two approaches, Q-learning and Policy Gradients.

Q-Learning

The goal of a Q-Learning algorithm is to maximize future rewards. We introduce the Q function, which has two inputs:

Q: S X A -> R

where:

S is a state
A is an action
R is the Reward for taking action A in state S.

In other words, the Q function determines how much reward the agent can expect to receive (call this, R) if, when in a particular state (call this S), it takes a particular action (call this A).

The goal of Q-learning is to learn this function. In simple Q-learning, this is done by creating a table of all possible states and actions. After taking an action, the actor receives its reward, and can update values in the table. The update of values is usually done according to this messy math formula:

$Q-learning math$

Which is a fancy way of saying take the old value in the table, and adjust it based on the new observed value, and an estimate of what it can expect to receive later.

As you can imagine, having a table of all states and actions doesn't work for even a simple game like pong, where the total possible of game states (all possible positions of the paddles, all possible positions for the ball, and all possible velocities of the ball) becomes very large and unmanageable. However, for simpler grid based games, like tic tac toe, this approach works well, and has been proven to converge on the optimal solution (aka the best possible actions) (with enough time, of course).

Deep Q-Learning

With the recent resurgence in neural network research, researchers at DeepMind created the Deep Q-Learning algorithm. Rather than attempting to learn the exact representation of the Q-function, we use a neural network to approximate the Q-function, and use the reward signal as our error signal. This turns out to be a much more tractable problem. Researchers at DeepMind applied this algorithm (with several small tweaks and such to improve performance) to 50 Atari games, and got better-than-human level performance in about half of them, using raw screen images as their actor's input, just as a human might have. Their actors used the learned Q-function to select actions that had the highest reward.

Policy Gradient

Another approach to RL is the idea of a policy function. If you think of the Q-function as a value function, where the output of the function is the value of a given action, the policy function output instead attempts to directly address the problem of what action to take next. There are many similarities between a policy function and a value function. However, where a value function can be used to make good actions, a policy function is trained directly to take good actions.

We can use neural networks to represent our policy functions. It would be nice if we could find the gradient of our policy function with respect to our reward signal, however we don't know exactly what the reward function is, as it is part of the environment which our actor does not know about. If we could calculate the gradient, we could use our standard SGD algorithm to train the policy.

Therefore, the first step in the Policy Gradient scheme is to find some way to approximate the gradient of the policy function. There are various methods to do that, such as vanilla policy gradients, natural policy gradients (more info here. We'll look at a simple method, called the finite difference approach. The idea is simple: try out several policies, close to the original policy, and calculate the average rate of change with respect to each parameter (aka gradient). Note, however, that we need to do several rollouts before we get an idea of the gradient. A rollout (sometimes called an experiment) is a complete run through our environment. This might be a single game of chess, for instance.

Once we have an idea of the gradient, we can adjust our policy to do better over time. As we train on more and more rollouts, our actor improves. Just as with the other examples we've seen, as we train our network, it slowly gets better. In this case, our network becomes better at predicting "good" moves for the actor to make.

Code

We will be using the Open AI Gym, which is a site with a collection of reinforcement learning problems. We are also going to use another python library to implement the policy gradient estimation. you will need to get these packages on csx.

pip3 install --user rlflow pip3 install --user gym

We're going to be learning the cart-pole example environment.

import gym
import tensorflow as tf
import tflearn

tf.global_variables_initializer = tf.initialize_all_variables # hack around csx old tensorflow version


from rlflow.core import tf_utils
from rlflow.policies.f_approx import Network
from rlflow.algos.grad import PolicyGradient

if __name__ == "__main__":
    env = gym.make("CartPole-v0")
    # we can't render on csx, so redefine the 'render' function to do nothing, instead of crash.
    def donothing(*args, **kwargs):
        pass
    env.render = donothing

    with tf.Session() as sess: # wee need a tensorflow session.... don't worry about this line.

        # Build neural network (very small network, with 1 hidden layer of 4 neurons)
        input_tensor = tflearn.input_data(shape=tf_utils.get_input_tensor_shape(env))
        net = tflearn.fully_connected(input_tensor, 4, activation='sigmoid')
        net = tflearn.fully_connected(net, env.action_space.n, activation='softmax')

        # tell RLFlow that our model is a ANN
        policy = Network(net,
                         sess,
                         Network.TYPE_PG)

        #Setup the policy gradient algorithm in RLflow
        pg = PolicyGradient(env,
                            policy,
                            session=sess,
                            episode_len=1000,
                            discount=True,
                            optimizer='adam')

        #And finally, train for a bit.

        pg.train(max_episodes=50000)
        rewards = pg.test(episodes=10)
        print ("Average: ", float(sum(rewards)) / len(rewards))

adapted from here

References

Great lecture by David Silver on Reinforcement Learning
Reinforcement Learning: An introduction by Richard Sutton