DeepRL - newlife-js/Wiki GitHub Wiki

by 서울대학교 양인순교수님

Reinforcement Learning 기초

Goal: maximize the expected cumulative reward

Agent

observes state s_t
executes action a_t
receives reward r_t

Environment

receives action a_t
emits reward r_t
updates state to s_t+1

■ Policy(mapping from state to action): π
Deterministic: π(s_t) = a_t
Stochastic(randomized): π(a_t|s_t) = Prob(a_t=a|s_t=s)

※ Decision rule: prescribes a procedure for action selection in each state at a specified stage
모든 stage에서의 decision rule을 모아놓은 것을 policy(strategy)라고 함
π_t: S->A
Markov(현재만 고려) vs history dependent(이전 상태들도 고려)
deterministic vs stochastic(randomized)

RL Problem

장점

flexibility: no knowledge about model
adaptivity: no reliance on fixed model
scalability: approximate solution to large-scale problems

단점

limited knowledge: no knowledge about model
exploration vs exploitation: unclear what to try
limited representation: unclear how to approximately represent the value function or policy of complex problmes

real-world challenges

Unstructured environments
Complex sensory inputs
Fast adaptation(환경이 너무 빨리 바뀜)
Reliability, safety(자율주행 인명사고 위험)

=> Deep learning 적용

can process complex sensory inputs
can handle unstructured environments
powerful computing tech and algorithms for training

Deep RL = RL + Deep learning

RL: sequential decision making interacting with environments
DL: representation of policies and value functions

Markov Decision Process(MDP)

참고
<S, A, P, R, γ>
find an optimal policy that maximizes the expected cumulative reward

S: set of states
A: set of actions
p: state transition probability / p(s'|s, a) = Prob(s_t+1=s'|s_t=s, a_t=a)
r: reward function / r(s_t, a_t) = r_t
γ: discount factor ∈ (0,1]

Assumtions

Rewards[r(s,a)] and transition probabilities[p(s'|s,a)] are stationary
Finite state and action sets -> r(s,a) is bounded(|r(s,a)| < M)
Discounting: 0<= γ < 1

Value function

The value function v^π(s) of a policy π is the expected reward starting from state s under executing π

Policy Evaluation

Decompose the value function

immediate reward
discounted value of next state

deterministic한 경우에는 a대신 π(s)가 들어가고, summation이 사라짐

여기서 =은 대입을 뜻하며, k가 무한대까지 가면 v(s)는 수렴하게 되며, 이 때의 v(s)가 true value function이 된다.

Policy Iteration(참고)

policy evaluation과 improvement를 번갈아가며 수행하며 optimal policy를 찾아간다.

※ Vector form(state 1~n까지의 value function을 v^π vector로 구성)

(단점: matrix inversion is inefficient for large-scale problems)

Contraction

Banach Fixed Point Theorem

Value Iteration

evaluation은 한 번만 하고 improvement만 여러 번 수행
greedy improvement: v가 max가 되는 a를 선택하는 것
아래 과정을 반복해서 v*=Tv* 가 되도록 하면 v* 가 optimal value function이 된다.

Bellman Operator T

※ Value Iteration과 Policy Iteration 둘 중 하나로 optimal value function 및 policy를 구할 수 있음
VI is simpler; but PI is often faster

※ 위와 같이 Dynamic programming을 이용해서 update를 하면, full-width backup이기 때문에 연산량이 많이 필요하다...
-> sampling backup: 한 action을 취해보고(try) 그 정보를 토대로 value function 업데이트
=> 계산도 효율적이고, model-free하다(iteration마다 reward function과 state transition matrix를 알 필요가 없기 때문에)

Q-Function(State-Action Value Function)

action에 따른 v를 보여주는 것이 Q

Q-Learning

model-free algorithm to find optimal policy
(여기서 model이란 reward와 transitioin probability에 대해 아는 것을 말한다.
model을 안다면, dynamic programming을 통해서 Bellman equation의 해를 구하면 된다.)
P, R을 모르더라도 데이터(=sample(s,a,s',r))을 가지고 estimate를 할 수 있음
new estimate만으로 다음 Q를 결정하지 않고, α learning rate를 두어 천천히 학습하도록 함

argmax_a{Q*(s,a)}로 optimal policy π(s)를 구할 수 있음

장점: very simple, off-policy(can use any policy to generate samples)

단점: large-scale problems에는 적용하기 어려움, correlation between samples, overestimation, exploration issue

Value Function approximation

large-scale problems을 풀기 위해, Q-function을 parameterize해서 approximate Q-Learning을 수행함
tabular methods(action value function을 table 형태로 만드는 것)는 메모리도 많이 들고 시간도 느림, continuous problem은 구하기 힘듦
기존의 Q-function가 아닌 parameter w를 update함
SGD나 batchGD 등의 방법을 통해서 최적의 w를 찾음

DQN(Deep Q-Networks)

참고
DeepRL: value function approximation을 DNN을 통해서 수행

Experience Replay(Replay buffer): 해당 시점 근처의 데이터만 sampling하면 잘못된 fitting이 될 수 있어(연속된 상태 간 correlation으로 인해) 이전 sampling들을 저장해서 fitting하는 데 사용, 데이터도 재사용할 수 있어 효율성도 높음
Fixed Q target: consistent and stable training을 위해 특정 학습 횟수동안 타겟(Q(s',a): θ-)을 고정한다
(Q의 변화에 따라 타겟과 추정치가 모두 변화하면 안정적인 학습이 어려움)

action에 대해 max를 취해야 하기 때문에, action의 수가 많으면 적용하기 어려움

Double Q-Learning DQN

overestimation을 줄이기 위해 Q-Function 2개를 사용함

Very simple, sample efficient, fairly stable training

Parameterizing Policy(Policy Gradient)

p_θ 분포를 따르는 τ에 대해 위 식의 기댓값을 구하는 것
최종 식은 transition probability에 의존하지 않게 된다.
expectation을 sample mean으로 대체해서 계산하면 gradient를 구할 수 있다.(REINFORCE algorithm)
τ에 대한 sampling은 현재 policy를 바탕으로 해야 함

장점: simple, unbiased gradient, local optimal solution

단점: high variance of the gradient, on policy(huge # of samples required)

※ variance를 줄이기 위해 reward에 baseline을 설정(value function으로)해준다.

Actor-Critic algorithm

위의 value function을 fit하기 위한 방법

Critic: action value function을 통해 현재의 policy를 평가
Actor: policy gradient로 θ를 update

※ policy gradient 식을 유도하기 위해서는 stochastic policy를 사용해야만 했지만, deterministic policy gradient(DPG) 알고리즘이 개발됨

Deep Deterministic Policy Gradient(DDPG)

critic에 DQN을 사용

Issues in Policy Gradient

TRPO(Trust Region Policy Optimization)

Use surrogate objective to guarantee monotonic improvement(good direction)
(@ Minorize-Maximazation algorithm)
Trust region method(good stepsize)
믿을 수 있는 한 가장 크게 stepsize를 설정
Importance sampling
-> 과거의 policy를 이용해서 sample을 generate 할 수 있음(off-policy)

Maximum Entropy Stochastic Control and RL

Exploration vs Exploitation

Exploration: Gather more information
Exploitation: Make the best decision given current information 둘은 서로 trade-off가 있으며 균형있게 사용하는 것이 중요

어떻게 Exploration과 Exploitation 중에 선택을 할 것인가?

ε-greedy: 1-ε의 확률로 greedy action(argmax Q)을 선택, ε의 확률로 random action 선택
simple and light computation / fine tuning ε, not quite systematic, inefficient
Optimism in the face of uncertainty(OFU): construct confidence set for MDP parameters, choose MDP parameters that give the highest rewardsand construct an optimal policy
regret optimal, systematic / complicated, computation intensive
Thompson (posterior) sampling: sample MDP parameters from posterior distribution, construct optimal policy and update the posterior distribution
regret optimal, systematic / somewhat complicated
Noisy networks: inject noise to the weight sof NN
simple, good empirical performance / not systematic
Information theoretic exploration: use high entropy policy to explore more
simple, good empirical performance / more theoretic analyses needed

Maximum entropy MDP problem

장점: computation 적음(no maximization involved), exploration 잘됨, standard MDP와의 structural similarity

Soft Actor-Critic.....

실습

gym(OpenAPI Gym)

RL benchmark problem 제공 패키지
Env 및 simulator를 사용하기 쉽도록 구성해 놓음
Agent만 짜고, gym의 env와 simulator를 이용해 학습하면 됨

예시)

import gym

env_id = 'Acrobot-v1'
# Build environment with env ID!
env = gym.make(env_id)
if colab:
    env = wrap_env(env)

# Initialize environment
observation = env.reset()

for t in range(1000):
    env.render()
    # Randomly sample action from environment
    action = env.action_space.sample() 
    # Simulate 1 step
    observation, reward, done, info = env.step(action) # observation = state
    # done is used to check terminal condition    
    if done: 
      break
# Close environment
env.close()

# Render environment
if colab:
    show_video()