Papers - shivamvats/notes GitHub Wiki

Foundational Papers

DAGGER- Deals with the problem of sequential prediction in an imitation learning setting. The observations made by an agent are dependent on the actions taken by its policy in the previous time-steps, which violates the i.i.d assumption made by most statistical algorithms. Classical approaches either ignore this dependence (leading to poor theoretical and empirical performance) or train a non-stationary policy (a different policy for each time-step) or a stochastic policy. DAGGER trains a stationary and deterministic policy that is trained on all the states seen so far over all the iterations leading to better guarantees and performance. It analyzes the algorithm as a reduction of imitation learning to no-regret online learning where each mini-batch of trajectories seen by a policy is treated as a single online-learning example. (13th June, 19)
Depth First Iterative Deepening- First talks about Depth First Iterative Deepening algorithm which is asymptotically the optimal un-informed tree search algorithm in terms of nodes expanded. Then it combines this idea of iterative deepening with A* to discuss Iterative Deepening A* that is an informed admissible algorithm which asymptotically expands the same number of nodes as A* without needing to store a CLOSED or OPEN list. One major drawback of depth-first methods is that, given a depth, they need to explore all possible paths to that depth. In the case of a lattice, this number is usually very high. Hence, it seems that IDA* is not well suited to lattices. (18th June, 19)
Inverse Reinforcement Learning from Failure - Standard IRL approaches learn a reward function (parameterized by some feature functions) such that the optimal policy induced by that reward best approximates the policy that is exhibited by a sample of optimal trajectories. IRLF adds to this by additionally making use of negative examples, i.e. given an extra set of trajectories (negative data), the reward function is optimized such that it induces a policy that explicitly avoids producing trajectories similar to the negative data set.

RL

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning (Ben, Ruslan and Sergey) - This is better described as a paper on combining graph search and RL policies. The motivating domain is manipulation with object-interaction using images. One could treat it as a pure planning problem but the state space (image space) would be huge. So, this paper proposes to use states on the replay buffer (what's that?) as nodes and learns the value function to use it as edge costs. Now, you can use the good old Dijkstra's algorithm. The fundamental idea is to learn a map from the image space to a tractable set of nodes. Note that if you were not depending on just images, you wouldn't have this problem. Related: PRM-RL

DL

1 . Population Based Training of Neural Networks (Deepmind):

Problem: Hyperparameter tuning is resource intensive. Prior works make use of MAB, random search, grid search etc. The main drawback is that they are either sequential or if they are parallelized, no information is shared.
Questions: The learning behaviour of neural networks are often highly non-linear. How do you predict their future behavior based on their past?
This paper formulates this problem as a mix of genetic algorithms and random search.
Instead of spending all the budget on a fixed hyper-parameter setting, they optimize for a schedule of hyperparameter settings to maximize performance.
Different models are run in parallel. Once a model is deemed ready, its weights are replaced with the best performing weights (genetic) and slightly perturbed (random search).
Results: The approach shows improved performance on a wide variety of DL and RL benchmarks compared to grid and random search.
Answers: They simply wait for a certain number of time-steps or until gradient based learning has happened before exploring/exploiting.

Optimization

A Direct Method for Trajectory Optimization of Rigid Bodies Through Contact (Posa, Tedrake, 2013) -