Reinforcement Learning - jaeaehkim/trading_system_beta GitHub Wiki

ํ•™์Šต์˜ ๋ถ„๋ฅ˜

image

  • Machine Learning (๊ธฐ๊ณ„ํ•™์Šต)
    • ๊ธฐ๊ณ„์— ๋ฌด์–ธ๊ฐ€๋ฅผ ๋ฐฐ์šฐ๊ฒŒ ํ•˜๋Š” ๊ฒƒ
  • Supervised Learning(์ง€๋„ํ•™์Šต)
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ(Training Data)๋กœ๋ถ€ํ„ฐ ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋ฅผ ์œ ์ถ”ํ•ด๋‚ด๊ธฐ ์œ„ํ•œ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ํ•œ ๋ฐฉ๋ฒ•
      • ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋ฅผ ์œ ์ถ”ํ•ด๋‚ด๊ธฐ ์œ„ํ•ด ์ •๋‹ต์ง€(Label)๊ฐ€ ์กด์žฌํ•ด์•ผ ํ•จ.
      • ํ•จ์ˆ˜์˜ ์œ ํšจ์„ฑ์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ(Test Data)๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์ด ๊ด€์Šต์ .
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋Š” Raw Input์ด ๋  ์ˆ˜๋„ ์žˆ๊ณ  Raw Input์„ ๊ฐ€๊ณตํ•œ ๋ฐ์ดํ„ฐ -> Feature ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑํ•  ์ˆ˜๋„ ์žˆ์Œ
    • ์‹ ๊ฒฝ๋ง ํ•™์Šต(or ๋”ฅ๋Ÿฌ๋‹)๊ณผ ๊ธฐ์กด์˜ ํ†ต๊ณ„์  ๋จธ์‹ ๋Ÿฌ๋‹(Logistic, Decision Tree, Random Forest, SVM etc)์˜ ๋ณธ์งˆ์ ์ธ ์ฐจ์ด๋Š” Layer๋ฅผ ์Œ“์•„์„œ ๋‹ค์ธต์  ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด ๋” ๋™์ ์ธ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋А๋ƒ ์—ฌ๋ถ€
      • ํ•˜์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” Raw Input๋งŒ์œผ๋กœ๋„ feature๋ฅผ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ํ•™์Šต ๋ฐฉ์‹์ด๋ผ๊ณ ๋„ ๋งํ•˜๋‚˜ ์ด๋Š” ์—„๋ฐ€ํžˆ๋Š” ์„ค๊ณ„์˜ ์ฐจ์ด์— ๊ฐ€๊นŒ์›€
  • Unsupervised Learning(๋น„์ง€๋„ํ•™์Šต)
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๊นŒ์ง„ ์ง€๋„ํ•™์Šต๊ณผ ๋™์ผํ•˜๋‚˜ ์ •๋‹ต์ง€(Label)์ด ์—†์ด ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก 
    • ์ฃผ๋กœ, ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ์ง€(ํŠน์ง•,๊ตฌ์กฐ)์— ๊ด€ํ•œ ๋ฌธ์ œ๋ฅผ ํ’ˆ
      • ex) ํด๋Ÿฌ์Šคํ„ฐ๋ง, PCA
  • Reinforcement Learning(๊ฐ•ํ™”ํ•™์Šต)
    • ์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ • ๋ฌธ์ œ์—์„œ ๋ˆ„์ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™” ํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•ด ํ–‰๋™์„ ๊ต์ •ํ•˜๋Š” ํ•™์Šต ๊ณผ์ •
      • ์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ • ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— 'ํ€€ํŠธ'์™€๋„ ์ž˜ ๋งž๋Š”๋‹ค.
      • ํŠนํžˆ, ์ „์ œ๊ฐ€ ์ˆœ์ฐจ์  ์˜์‚ฌ๊ฒฐ์ •์ด๊ธฐ ๋•Œ๋ฌธ์— non iid์—ฌ๋„ ์ƒ๊ด€์—†๋‹ค. ML์€ iid์—ฌ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ณต์žกํ•œ ์ ˆ์ฐจ๊ฐ€ ํ•„์š” DataStructures

๊ฐ•ํ™”ํ•™์Šต์˜ ๊ตฌ์กฐ

image

  • Architecture
    • Agent : ๊ฐ•ํ™”ํ•™์Šต์˜ ์ฃผ์ฒด
    • Environment : Agent๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ์š”์†Œ
    • Flow : Agent๋Š” action์„ ํ•˜๊ณ  action์— ๋Œ€ํ•œ ๋ฐ˜์‘์„ Environment์—์„œ ์ œ๊ณตํ•จ. State & Reward๋ฅผ Agent๋Š” ์‘๋‹ต ๋ฐ›์Œ. ํ•ด๋‹น Loop๋ฅผ ์ผ์ • ์ฃผ๊ธฐ(time step)๋กœ ๋ฐ˜๋ณตํ•จ. ์ด๋Ÿฐ trial & error๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ์ •๋‹ต์ง€(Label, ์–ด๋–ป๊ฒŒ)๊ฐ€ ์—†์ด ์ตœ์ ์˜ action์„ ํ•  ์ˆ˜ ์žˆ๋Š” Agent๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต.
      • Reward ๊ฐœ๋…๊ณผ ์ง€๋„ํ•™์Šต์˜ Label ๋ฐ์ดํ„ฐ ๊ฐœ๋…์€ ๊ตฌ๋ณ„๋จ.
  • vs ์ง€๋„ํ•™์Šต
    • ๋ณด์ƒ(Reward) ๊ฐœ๋… ์ž์ฒด๊ฐ€ sparse(ํฌ์†Œ) & delay(์ง€์—ฐ) ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ ์ž์ฒด์—์„œ ์ง€๋„ํ•™์Šต์˜ Label ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•œ ํ•™์Šต๊ณผ ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค๋ฅด๊ฒŒ ๋งŒ๋“ฆ.
      • ์ง€๋„ํ•™์Šต์—์„  ํ›ˆ๋ จ๋ฐ์ดํ„ฐ์™€ Label ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ '์ง์ ‘ & ๋ฐ”๋กœ' ํ•จ์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ
      • ๊ฐ•ํ™”ํ•™์Šต์—์„  action์— ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ฅธ ๊ณ ๋ ค ์ž์ฒด๋„ ํ•™์Šต์˜ ๋Œ€์ƒ ์•ˆ์— ๋“ค์–ด๊ฐ„๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ
    • Quant ๊ด€์ 
        1. ์ง€๋„ํ•™์Šต์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์ „์ฒ˜๋ฆฌ ์ž‘์—…(for iid)์„ ํ•˜๋Š” ๊ณผ์ •๊ณผ ๋ฆฌ์„œ์น˜๋ฅผ ํ†ตํ•œ feature ์ƒ์„ฑํ•˜๋Š” Factory๋ฅผ ๋งŒ๋“œ๋Š” ํ€€ํŠธ ์‹œ์Šคํ…œ
        1. ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด feature(์‚ฌ์ „์ง€์‹) ์—†์ด ์ฒ ์ €ํ•œ trial & error๋ฅผ ํ†ตํ•œ ๋‹ค์ˆ˜์˜ agent๋ฅผ ๋ณด์œ ํ•œ ํ€€ํŠธ ์‹œ์Šคํ…œ
      • ๋ฌด์กฐ๊ฑด ํ•œ์ชฝ์ด ์ข‹๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์—†๊ณ  ์„œ๋กœ๊ฐ€ ๊ตฌ์กฐ์ ์œผ๋กœ ๋‹ค๋ฅด๊ธฐ์— ๋‹ค๋ฅธ ํŒจํ„ด์„ ์ฐพ์•„๋‚ผ ๊ฐ€๋Šฅ์„ฑ์€ ๋†’๋‹ค๊ณ  ๋ณด๊ธฐ ๋•Œ๋ฌธ์— ํ€€ํŠธ ์‹œ์Šคํ…œ ์ž์ฒด๋„ Ensembleํ•˜์—ฌ์•ผ ํ•จ.

MDP (Markov Decision Process)

MP (Markov Process)

image

  • MP = (S,P)์ด๊ณ  initial state -> terminal state ๊นŒ์ง€์˜ Process์—์„œ P(transition prob matrix)์˜ ์›์†Œ P(transition prob)์— ์˜ํ•ด์„œ state ๊ฐ„ ์ด๋™์ด ๊ฒฐ์ •๋œ๋‹ค.

Markov Property

  • image
    • "๋ฏธ๋ž˜๋Š” ์˜ค๋กœ์ง€ ํ˜„์žฌ ์ƒํƒœ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค"๋Š” ๋Œ€์ „์ œ์— ์˜ํ•ด MP๊ฐ€ ์ •์˜๋œ๋‹ค.
    • ๋‹ค๋งŒ, Markov Property๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉํ–ฅ์œผ๋กœ ํ™•์žฅ๋  ์ˆ˜ ์žˆ๋‹ค. ํ™•์žฅ์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ variant version์ด ํƒ„์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.
      • t์‹œ์ ๊ณผ t+1์‹œ์ ์˜ ์ •๋ณด๊ฐ€ ๊ผญ State์ผ ํ•„์š”๋Š” ์—†๋‹ค. (Reward, Action์œผ๋กœ๋„ ํ™•์žฅ ๊ฐ€๋Šฅ > MRP,MDP)
      • ๊ผญ t+1๊ณผ t์‹œ์ ๊ฐ„์˜ ๊ด€๊ณ„์ผ ํ•„์š”๊ฐ€ ์—†๋‹ค. (t+1 <- (t,t-1,t-2))

MRP (Markov Reward Process)

  • image
  • image
  • image
    • MRP = (S,P,R,gamma) ์ด๊ณ  S,P์˜ ์ •์˜๋Š” MP์™€ ๋™์ผ
    • **R(Reward Function)**์€ ์–ด๋–ค ์ƒํƒœ S_t์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ ๋ฐ›๋Š” ๋ณด์ƒ์„ ์˜๋ฏธํ•˜๊ณ  ์–ด๋–ค ์ƒํƒœ๋ƒ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” random variable์ด๊ธฐ์— Expectation์„ ํ™œ์šฉํ•ด ํ‘œํ˜„๋จ.
    • gamma ๋Š” decay factor์ด๊ณ  0~1์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.
      • why? 1) ๋ฐœ์‚ฐ์„ ๋ฐฉ์ง€ํ•˜๋Š” ์ˆ˜ํ•™์  ํŽธ๋ฆฌ์„ฑ, 2) ์‹œ๊ฐ„ ์„ ํ˜ธ ๋ชจ๋ธ๋ง, 3) ๋ฏธ๋ž˜ ๊ฐ€์น˜์— ๋ถˆํ™•์‹ค์„ฑ ํˆฌ์—ฌ
    • ๊ฐ•ํ™”ํ•™์Šต ํ”„๋กœ์„ธ์Šค๋ฅผ episode๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ ์ด๋Š” s0, R0, s1, R1 ... sT, RT ๊นŒ์ง€์˜ ์—ฌ์ •์„ ์˜๋ฏธํ•œ๋‹ค. ์ด๋•Œ ์ž„์˜์˜ ์‹œ์  T์—์„œ ์•ž์œผ๋กœ ๋ฐ›๋Š” ๋ณด์ƒ์˜ ํ•ฉ์€ ์˜๋ฏธ์ ์œผ๋กœ ์ค‘์š”ํ•˜๊ธฐ์— ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•œ๋‹ค. Return G_t๋กœ ์ •์˜ํ•˜๊ณ  ์ˆ˜์‹์œผ๋กœ t+1~N์‹œ์  ๊นŒ์ง€์˜ sum of gamma*reward๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.
    • T์‹œ์ ์—์„œ์˜ ์ƒํƒœ๊ฐ€ ๊ฐ–๋Š” ๊ฐ€์น˜(value)๋ฅผ ์–ด๋–ป๊ฒŒ ์ •๋Ÿ‰ํ™” ํ•  ๊ฒƒ์ธ๊ฐ€? ์ •๋Ÿ‰ํ™”๋œ ํ˜•ํƒœ๋ฅผ **์ƒํƒœ๊ฐ€์น˜ํ•จ์ˆ˜(state-value function)**๋กœ ์ •์˜ํ•˜๊ณ  ์ด๋Š” St=s์ผ ๋•Œ์˜ Return์˜ ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. s๋ถ€ํ„ฐ ์–ด๋–ค sampling์„ ํ•˜์—ฌ ์–ด๋–ค episode๊ฐ€ ๊ตฌ์„ฑ๋˜๋А๋ƒ์— ๋”ฐ๋ผ return์˜ ๊ฐ’์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์—.

MDP (Markov Decision Process)

  • image
  • image
  • image
  • image
    • MDP = (S,A,P,R,gamma) MRP์— decision ๊ฐœ๋… (by action, policy)์ด ์ถ”๊ฐ€๋˜์—ˆ์Œ.
    • ์ด์ „์—” s->s'์˜ ์ด๋™์ด transition state probability์— ์˜ํ•ด์„œ๋งŒ state๊ฐ€ ๋ณ€ํ–ˆ๋‹ค๋ฉด action์„ ํ†ตํ•ด์„œ state๊ฐ€ ๋ณ€ํ•˜๋Š” ๊ฒƒ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ํ™•๋ฅ ์  ๊ณผ์ •์ด ํ•˜๋‚˜ ๋” ์ถ”๊ฐ€๋จ
    • policy function(pi function)๋Š” ๊ฐ ์ƒํƒœ s์— ๋”ฐ๋ผ ์–ด๋–ค action a๋ฅผ ์„ ํƒํ•  ํ™•๋ฅ ์„ ์˜๋ฏธํ•จ
    • MDPํ•˜์—์„œ์˜ state-value function์€ s->s' transition ํ•  ๋•Œ pi์— ๋”ฐ๋ผ ์›€์ง์ธ๋‹ค๋Š” ์ ์„ ์ œ์™ธํ•˜๊ณ ๋Š” ์ˆ˜์‹์ ์œผ๋ก  ๋™์ผํ•˜๋‹ค.
    • (state-)action value function์€ '๊ฐ ์ƒํƒœ์—์„œ์˜ ์•ก์…˜์„ ํ‰๊ฐ€'ํ•˜๊ธฐ ์œ„ํ•จ. ์ด๋•Œ action๋งŒ ๋”ฐ๋กœ ํ‰๊ฐ€๋Š” ๋ถˆ๊ฐ€๋Šฅํ•จ.
      • V_pi_(s)๋Š” s์—์„œ pi๊ฐ€ ์•ก์…˜์„ ์„ ํƒํ•˜๊ณ  q_pi_(s,a)๋Š” s์—์„œ ๊ฐ•์ œ๋กœ a๋ฅผ ์„ ํƒํ•œ ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ผ์ข…์˜ A_t=a ์กฐ๊ฑด๋ถ€ ์ƒํƒœ.

For What

  • Prediction : pi๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ฐ ์ƒํƒœ์˜ value๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ
  • Control : ์ตœ์  ์ •์ฑ… pi_star๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ

Bellman Equation (MDP)

Bellman Expectation Equation

  • image image
    • G_t์˜ ์ •์˜์— ๋”ฐ๋ผ ์žฌ๊ท€์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์œ„์˜ ์ˆ˜์‹์ด ๋„์ถœ๋จ.

  • image image
  • ์œ„์˜ ์ˆ˜์‹์€ ํ˜„์žฌ ์ƒํƒœ์™€ ๋‹ค์Œ ์ƒํƒœ value๋ฅผ expectation์„ ํ†ตํ•ด์„œ ์—ฐ๊ฒฐํ•œ ์‹์ธ ๋ฐ˜๋ฉด ์œ„์˜ ์‹์€ ์‹ค์ œ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์„ ์ˆ˜์‹ํ™”
  • v_pi_(s) [s์˜ value]๋ฅผ 1-time-step์—์„œ action a๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์ „์„ ๋ถ„ํ•ดํ•ด๋ณด๋ฉด s์—์„œ a๋ฅผ ์‹คํ–‰ํ•  ํ™•๋ฅ ๊ณผ s์—์„œ a๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์˜ value์˜ ๊ณฑ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  time-step์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ง‘ํ•ฉ A์— ๋Œ€ํ•ด sum
  • q_pi_(s) [s์—์„œ a๋ฅผ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์˜ value]๋Š” action a๋ฅผ ํ†ตํ•ด์„œ ์–ป๋Š” ๋ณด์ƒ๊ณผ s์—์„œ a๋ฅผ ํ†ตํ•ด s'๊ฐˆ ํ™•๋ฅ ๊ณผ s'์˜ value๋ฅผ ๊ณฑํ•œ ๊ฒƒ์„ ์ดํ›„ ๋ชจ๋“  ์ƒํƒœ์— ์ ์šฉํ•œ sum์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • image image
    • ์œ„์˜ ์ˆ˜์‹์„ ํ†ตํ•ด ์„œ๋กœ ๊ต์ฐจํ•˜์—ฌ ๋Œ€์ž…ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์œ ๋„๋œ๋‹ค.
    • ์œ„์˜ ์‹์œผ๋กœ ํ†ตํ•ด r_s, P_ss'๋ฅผ ์•Œ๋ฉด ์ง์ ‘์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.

  • image image
    • matrix form์œผ๋กœ ํ‘œํ˜„ํ•œ ์ˆ˜์‹

Bellman Optimal Equation

  • image
    • ๋‹ค๋ฅธ ์ •์ฑ…์„ ๋”ฐ๋ฅผ ๋•Œ ๋ณด๋‹ค optimal policy pi star๋ฅผ ๋”ฐ๋ฅผ ๋•Œ ์–ป๋Š” ๋ณด์ƒ์˜ ์ด ํ•ฉ์ด ๊ฐ€์žฅ ํฌ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.
    • optimal policy์— ๊ด€๋ จํ•ด์„œ๋Š” partial ordering์˜ ๊ฐœ๋…์œผ๋กœ ์ƒ๊ฐํ•ด์•ผ ํ•จ.
      • image
    • ์—ฌ๊ธฐ์— ๋”ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •๋ฆฌ๊ฐ€ ์ฆ๋ช…๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ optimal policy๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•˜๋Š” ๊ฒƒ์— ์ „๋…ํ•  ์ˆ˜ ์žˆ์Œ.
      • image

  • image
    • policy function์€ optimalํ•œ ๊ฒƒ์œผ๋กœ ์ •ํ•ด์กŒ๊ธฐ์— deterministic ํ•ด์ง.

  • image image
    • Bellman Expectation Equation์˜ pi๋ฅผ ์œ„์˜ 1 ๊ฐ’์œผ๋กœ ๊ณ ์ •๋˜๋ฉด์„œ ์œ„์™€ ๊ฐ™์€ ์‹์ด ์œ ๋„๋จ. q์ž…์žฅ์—์„  ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์€ ์—†์Œ. ์ด๋ฏธ a๊ฐ€ ๊ฒฐ์ •๋œ ์ƒํƒœ์—์„œ์˜ ์‹์ด๊ธฐ ๋•Œ๋ฌธ์—.

  • image image
    • v_(s)๋Š” ์‹ ์ „์ฒด์— max a๋ฅผ ์ทจํ•œ๋‹ค. ์‹ ์ „์ฒด๊ฐ€ a์™€ dependecy๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—. q_(s,a)์˜ ๊ฒฝ์šฐ๋Š” action a์™€ ๊ด€๋ จํ•œ ๊ฒƒ์€ ์ด๋ฏธ ๊ฒฐ์ •๋œ ๊ฒƒ๋“ค์ด๊ณ  ๋‹ค์Œ ์Šคํ…์—์„œ์˜ action a'์— ๊ด€ํ•ด์„œ๋งŒ max a๋ฅผ ์ทจํ•œ๋‹ค.

For What

  • Bellman Expectation Equation์€ ์ •์ฑ… pi๊ฐ€ ์ฃผ์–ด์ ธ ์žˆ๊ณ  pi๋ฅผ ํ‰ํ•˜๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•˜๊ณ , Bellman Optimal Equation์€ ์ตœ์ ์˜ value๋ฅผ ์ฐพ๊ณ  ์‹ถ์„ ๋•Œ ์‚ฌ์šฉํ•จ
  • MDP ๋ฌธ์ œ
    • model-free :r^a_s์™€ P^a_ss`๋ฅผ ๋ชจ๋ฅด๋Š” ๊ฒฝ์šฐ experience๋ฅผ ํ†ตํ•ด ํ•ด๊ฒฐ
    • model-based : r^a_s์™€ P^a_ss`๋ฅผ ๋ชจ๋‘ ์•„๋Š” ๊ฒฝ์šฐ

Model-Based (Planning)

Model-Free Prediction

Model-Free Control

Basic of Deep RL

DAVID SILVER, UCL Course on RL