DP and RL - shivamvats/notes GitHub Wiki
- "The theoretical question of “whether model-free algorithms can be made sample efficient” is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions." - Bubeck-Jordan
- "A major drawback of dynamic programming is that it requires the so-called transition probabilities of the system. These probabilities are often hard to obtain for complex and large-scale systems. Hence, a computational challenge created by real-world problems is to evaluate the value function without having to compute the transition functions." - Gosavi
- "The main bottleneck in policy iteration is the computation of the value function for a given policy. Using the Bellman equation, only two classes of problems have been solved exactly: tabular discrete state and action problems as well as LQRs. The exact computation of the value function remains an open problem for most systems with continuous state spaces." - Kroemer-Peters
RL algorithms have to deal with two problems:
-
Unknown Model: The most useful form of a model is the
transition probability model
, but this is almost never known. If the domain is such that we can get a lot of samples, then this transition model can be estimated accurately a priori. If that is not possible, then we can do the same if we have access to a cheap simulator and the problem spaceS x A
is not too large. If neither is possible, then we need to think of what Gosavi callsmodel-building
algorithms that incrementally improve their model (transition or otherwise) while computing optimal policies. -
Unknown Policy: If the transition model is accurately known, then we can simply use Value or Policy Iteration. This is usually not the case, and depending on choices made to solve
1
, an appropriate algorithm is chosen to compute the policy.
- Model-based: I need the transition-probability model before I can efficiently compute the value function using DP. If I have a simulator, I will first build a model and then learn the policy. But computing transition probabilities for a complex system is hard :( . Eg: Policy Iteration, Value Iteration.
- Model-building: I want to use the transition probabilities too but I won't wait until they are computed. I have access to a simulator which I will use to build a transition probability model during run-time. And as I do that, I will also update my value function/policy using these partially build models. Eg: R-MAX, MBIE, RTDP.
- Model-free: Who gives about a model? I have an efficient simulator which I will use to directly estimate the policy / value function. But as you would guess, these estimates can be unstable :( . Eg: Q-learning, Policy-gradient algorithms.
Researchers often spend a lot of efforts in building simplified simulators that allow them to get more samples. For example, Physics simulators. However, note that this still does not give us the transition model, which is what we want. It only makes it easier to compute the transition model or allows the use of model-free algorithms. The term model-free
is hence misleading. Because on most hard problems, they do need access to a simplified and cheaper simulator or model.
What should we do if we don't have a simulator or it is expensive?
Irrespective of whether we use model-based or model-free algorithms, there is value in learning/constructing a simulator
.
Model-based | Model-free |
---|---|
RTDP | Q-learning |
-
- Policy Iteration: Start with some policy and then loop until convergence:
-
- Policy Evaluation (compute its value function): This is a linear operation as we don't have to take min/max because we are following a policy. This lets us do linear programming to compute the exact value function (if small state space) or do value iteration (if big state space) , etc.
- Policy Improvement: Improve the policy by one-step look-ahead of the value function and compute the resulting policy.
This is much faster than Value Iteration in practice.
-
REPS: Relative Entropy Policy Search
-
Q-learning: Model-free version of RTDP. Updates Q value by computing the error between predicted Q-value and observed Q-value. The observed Q-value is computed as
r + max Q(s',a')
, where the max is over all possible actions that can be taken froms'
. Q-learning learns the optimal policy directly. -
SARSA: A more conservative and on-policy variation of Q-learning. Updates Q value by computing the error between predicted Q-value and observed Q-value. The observed Q-value is computed as
r + Q(s',a')
, wherea'
is the second action taken. SARSA learns a sub-optimal policy. To learn an optimal policy, random actions for exploration need to decay.
- Fitted VI: Represent the value function with a function approximator and train it using supervised learning. Checkout Sergey's slides : vi-slides and Chi Jin's results on linear approximation.
- QT-Opt : Self-supervised vision-based RL algorithm that learns a neural-network based Q function for real-world grasping.
- Motivated-RL Peter Dayan
- Intrinsic-Motivation-And-RL Barto
- garage : Implementations of many DRL algorithms.
- stable-baselines : RL baselines but better.
- baselines : Efficient implementations of DRL algorithms.
- acme : Deepmind' library of reinforcement learning components and agents
- coach : Intel's RL framework
- pymdptoolbox : Solvers for discrete MDP.
- spinningup : Tutorial + code. Good intermediate level resource that focusses on the core RL algorithms.