a2c - DailinH/MAgent GitHub Wiki

NOTES ON Actor-critic algorithm

The vast majority of Reinforcement Learning methods fall into one of the following two categories: Actor-only method and Critic-only method. The former works with a parameterized family of policies. The gradient of the performance, with respect to the actor parameters, is directly estimated by simulation, and the parameters are updated in a direction of improvement. The latter, on the other hand, relies exclusively on value function approximation and aim at learning an approximate solution to the Bellman equation, which then hopefully prescribe a near-optimal policy.

For actor-only methods, a possible drawback is that the gradient estimators may have a large variance. Also, as policy changes, a new gradient is estimated independently of past estimates and no "learning" occurs.

Critic-only methods are indirect in that they don't try to optimize directly over a policy space. Thus this method succeed in constructing a good approximation of the value function yet lack reliable guarantees in terms of near-optimality of the resulting policy.

Actor-critic methods[1] aim at combining the strong points of actor-only and critic-only methods. The critic uses an approximation architecture and simulation to larn a value function, which is then used to update the actor's policy parameters in a direct and performance improvement.

Actor-critic Algorithm

Let us consider a Markov decision process (MDP) with finite state space S and finite action space A. We denote the given cost function by $$g: S\times A\rightarrow R$$

A randomized stationary policy(RSP) is a mapping $$ \mu $$ that assigns each state x a probability distribution over the action space A.

Consider a set of RSP $$P={ \mu_\theta ; \theta \in R^n}$$ parameterized by vector $$\theta$$. For each $$(x,u)\in S\times A$$, $$\mu_\theta (x,u)$$ denotes the probability of taking action u when the state x is encountered. $$p_{xy}(u)$$ denotes the possibility that the next state is y given that the current state is x and current action is u.

We make the following assumptions:

the map $$\theta \rightarrow \mu_\theta(x,u)$$ is twice differentiable with bounded first, second derivatives
there exists a function $$\Phi(x,u)\in R^n $$ , such that$$\nabla\mu_\theta(x,u)=\mu_\theta(x,u)\Phi_\theta(x,u)$$
for each $$\theta \in R^n$$, Markov chains $${X_n }$$ and $${X_n, U_n}$$ are irreducible and aperiodic

Thus we have

$$\Phi_\theta(x,u) = \frac{\nabla\mu_\theta(x,u)}{\mu_\theta(x,u)}=\nabla \ln{\mu_\theta(x,u)}$$

Denote $$\eta_\theta(x,u)=\pi_\theta(x)\mu_\theta(x,u)$$, then consider average cost function$$\lambda$$

$$\lambda(\theta) = \sum_{z\in S, u\in A}{g(x,u)\eta_\theta(x,u)} $$

Thus, we are interested in minimizing $$\lambda(\theta)$$ all over $$\theta$$. We view actor-critic algorithms as stochastic gradient algorithms on the parameter space of the actor. [2]

Let $$\theta$$ be the actor parameter vector. The critic's goal is to compute an approximation of the projection $$\prod_\theta q_\theta $$ of $$q_\theta$$ onto $$\Phi_\theta$$. The actor uses this approximation to update its policy in an approximate gradient direction. This is actually similar to what Temporal Differential(TD) algorithms try to do. [3]

Sutton et al. proved that actor-critic method will converge.[4]

Asynchronous Advantage Actor Critic(A3C)[5]

Volodymyr Mnith et al. proposed an algorithm called A3C.

As illustrated in the previous section, at each time step t, the agent receives a state $s_t$ and selects an action $a_t$ from some set of possible actions $A$ according to its policy $\pi$ , where $\pi$ is a mapping from state $s_t$ to actions $a_t$. In return, the agent receives the next state $s_{t+1}$ and receives a scalar reward $r_t$.

In value-based model-free RL methods, the action value function is presented using a function approximator, e.g. neural network. In one-step Q-learning, the parameters $\theta$ of the action value function $Q(s,a; \theta)$ are learned by iteratively minimizing a sequence of loss functions

$$L_i(\theta_i )=E(r+\gamma \max_{a'}{Q(s',a';\theta_{i-1})-Q(s,a;\theta_i)})^2$$

This method updates the action value Q(s,a) toward the one-step return. However, obtaining a reward only directly affects the value of the state action pair $(s,a)$ that led to this reward. Using multiple parallel actor-learners also cause a reduction in training time roughly linear to the number of parallel actor-learners.

Asynchronous RL framework consists of asynchronous actor-learners which are used with multiple CPU threads on a single machine. Moreover, each actor-learner can use different exploration policies to maximize diversity. Thus no replay memory is needed in comparison with DQN in stabilizing learning.

In asynchronous one-step Q-learning scenario, each thread interacts with its own copy of the environment and computes the loss. A shared and slowly changing target network is used, and gradients are accumulated over multiple timesteps in order to reducing the chances of actor-learners overwriting each other's updates.

Asynchronous n-step Q-learning scenario operates in the forward view as opposed to the more common backward view used by techniques like eligibility traces. The algorithm first select actions using its exploration policy for up to $t_{max}$ steps or terminal state, and results in the agent receiving up to $t_{max}$ rewards since its last update. It then computes gradients for n_step Q-learning updates for each of the state-action pairs since last update.

A3C was developed to train deep neural network policies reliably and without large resource requirements. Unlike Q-learning, which is an off-policy value-based method, actor-critic is an on-policy value-based method.

A3C maintains a policy $\pi(a_t|s_t;\theta)$ and an estimate of the value function $V(s_t;\theta_v)$. It operates in the forward view and uses n-step returns to update both the policy and value-function.

Its update can be seen as

$$\nabla_{\theta'}\log{\pi(a_t|s_t;\theta')A(s_t, a_t;\theta,\theta_v)} $$

and

$$A(s_t, a_t;\theta,\theta_v)=\sum_{i=0}^{k=1}\gamma^i r_{t+i} + \gamma^kV(s_{t+k};\theta_v)-V(s_t;\theta_v)$$

Future

Yuhuai Wu et al. proposed a scalable trust-region method for DRL using Kronecker-factored approximation(ACKTR method). This is an improvement on average compared with a first-order gradient method(A2C). This suggests that extending Kronecker-factored natural gradient approximations to other algorithms in RL is quite promising. [6]

@article{konda1999actor,
  title={Actor-Critic--Type Learning Algorithms for Markov Decision Processes},
  author={Konda, Vijaymohan R and Borkar, Vivek S},
  journal={SIAM Journal on control and Optimization},
  volume={38},
  number={1},
  pages={94--123},
  year={1999},
  publisher={SIAM}
}

@inproceedings{konda2000actor,
  title={Actor-critic algorithms},
  author={Konda, Vijay R and Tsitsiklis, John N},
  booktitle={Advances in neural information processing systems},
  pages={1008--1014},
  year={2000}
}

@techreport{tsitsiklis1996analysis,
  title={An analysis of temporal-difference learning with function approximationTechnical},
  author={Tsitsiklis, JN and Van Roy, B},
  year={1996},
  institution={Report LIDS-P-2322). Laboratory for Information and Decision Systems, Massachusetts Institute of Technology}
}

@inproceedings{sutton2000policy,
  title={Policy gradient methods for reinforcement learning with function approximation},
  author={Sutton, Richard S and McAllester, David A and Singh, Satinder P and Mansour, Yishay},
  booktitle={Advances in neural information processing systems},
  pages={1057--1063},
  year={2000}
}

@inproceedings{mnih2016asynchronous,
  title={Asynchronous methods for deep reinforcement learning},
  author={Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray},
  booktitle={International conference on machine learning},
  pages={1928--1937},
  year={2016}
}

@inproceedings{wu2017scalable,
  title={Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation},
  author={Wu, Yuhuai and Mansimov, Elman and Grosse, Roger B and Liao, Shun and Ba, Jimmy},
  booktitle={Advances in neural information processing systems},
  pages={5279--5288},
  year={2017}
}