a2c - DailinH/MAgent GitHub Wiki
The vast majority of Reinforcement Learning methods fall into one of the following two categories: Actor-only method and Critic-only method. The former works with a parameterized family of policies. The gradient of the performance, with respect to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement. The latter, on the other hand, relies exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which then hopefully prescribe a near-optimal policy.
For actor-only methods, a possible drawback is that the gradient estimators may have a large variance. Also, as policy changes, a new gradient is estimated independently of past estimates and no "learning" occurs.
Critic-only methods are indirect in that they don't try to optimize directly over a policy space. Thus this method succeed in constructing a good approximation of the value function yet lack reliable guarantees in terms of near-optimality of the resulting policy.
Actor-critic methods[1] aim at combining the strong points of actor-only and critic-only methods. The critic uses an approximation architecture and simulation to larn a value function, which is then used to update the actor's policy parameters in a direct and performance improvement.
Let us consider a Markov decision process (MDP) with finite state space S and finite action space A. We denote the given cost function by
A randomized stationary policy(RSP) is a mapping $$ \mu $$ that assigns each state x a probability distribution over the action space A.
Consider a set of RSP
We make the following assumptions:
- the map
$$\theta \rightarrow \mu_\theta(x,u)$$ is twice differentiable with bounded first, second derivatives - there exists a function
$$\Phi(x,u)\in R^n $$ , such that$$\nabla\mu_\theta(x,u)=\mu_\theta(x,u)\Phi_\theta(x,u)$$ - for each
$$\theta \in R^n$$ , Markov chains$${X_n }$$ and$${X_n, U_n}$$ are irreducible and aperiodic
Thus we have
Denote
Thus, we are interested in minimizing
Let
Sutton et al. proved that actor-critic method will converge.[4]
Volodymyr Mnith et al. proposed an algorithm called A3C.
As illustrated in the previous section, at each time step t, the agent receives a state
In value-based model-free RL methods, the action value function is presented using a function approximator, e.g. neural network. In one-step Q-learning, the parameters
This method updates the action value Q(s,a) toward the one-step return. However, obtaining a reward only directly affects the value of the state action pair
Asynchronous RL framework consists of asynchronous actor-learners which are used with multiple CPU threads on a single machine. Moreover, each actor-learner can use different exploration policies to maximize diversity. Thus no replay memory is needed in comparison with DQN in stabilizing learning.
In asynchronous one-step Q-learning scenario, each thread interacts with its own copy of the environment and computes the loss. A shared and slowly changing target network is used, and gradients are accumulated over multiple timesteps in order to reducing the chances of actor-learners overwriting each other's updates.
Asynchronous n-step Q-learning scenario operates in the forward view as opposed to the more common backward view used by techniques like eligibility traces. The algorithm first select actions using its exploration policy for up to
A3C was developed to train deep neural network policies reliably and without large resource requirements. Unlike Q-learning, which is an off-policy value-based method, actor-critic is an on-policy value-based method.
A3C maintains a policy
Its update can be seen as
and
Yuhuai Wu et al. proposed a scalable trust-region method for DRL using Kronecker-factored approximation(ACKTR method). This is an improvement on average compared with a first-order gradient method(A2C). This suggests that extending Kronecker-factored natural gradient approximations to other algorithms in RL is quite promising. [6]
@article{konda1999actor,
title={Actor-Critic--Type Learning Algorithms for Markov Decision Processes},
author={Konda, Vijaymohan R and Borkar, Vivek S},
journal={SIAM Journal on control and Optimization},
volume={38},
number={1},
pages={94--123},
year={1999},
publisher={SIAM}
}
@inproceedings{konda2000actor,
title={Actor-critic algorithms},
author={Konda, Vijay R and Tsitsiklis, John N},
booktitle={Advances in neural information processing systems},
pages={1008--1014},
year={2000}
}
@techreport{tsitsiklis1996analysis,
title={An analysis of temporal-difference learning with function approximationTechnical},
author={Tsitsiklis, JN and Van Roy, B},
year={1996},
institution={Report LIDS-P-2322). Laboratory for Information and Decision Systems, Massachusetts Institute of Technology}
}
@inproceedings{sutton2000policy,
title={Policy gradient methods for reinforcement learning with function approximation},
author={Sutton, Richard S and McAllester, David A and Singh, Satinder P and Mansour, Yishay},
booktitle={Advances in neural information processing systems},
pages={1057--1063},
year={2000}
}
@inproceedings{mnih2016asynchronous,
title={Asynchronous methods for deep reinforcement learning},
author={Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray},
booktitle={International conference on machine learning},
pages={1928--1937},
year={2016}
}
@inproceedings{wu2017scalable,
title={Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation},
author={Wu, Yuhuai and Mansimov, Elman and Grosse, Roger B and Liao, Shun and Ba, Jimmy},
booktitle={Advances in neural information processing systems},
pages={5279--5288},
year={2017}
}
h