Quoridor Pseudo Code - SeoulTechTCPGame/block-it GitHub Wiki

์ฟผ๋ฆฌ๋„

๊ทœ์น™

  1. ์ž์‹ ์˜ ์•ž์ค„ ์ค‘๊ฐ„์— ๋ง์„ ๋†“๋Š”๋‹ค.
  2. $n$๋ฒˆ์งธ ํ”Œ๋ ˆ์ด์–ด๊ฐ€ ๋ง์„ ์›€์ง์ด๊ฑฐ๋‚˜ ๋ฒฝ์„ ์„ธ์šด๋‹ค.
  3. $turnNum + 1$์„ ํ•œ๋‹ค.
  4. 2~3๋ฒˆ ๊ณผ์ •์„ ๊ฒŒ์ž„์ด ๋๋‚ ๋•Œ ๊นŒ์ง€ ๋ฐ˜๋ณตํ•œ๋‹ค.

๊ฒŒ์ž„์€ ์ž„์˜์˜ ๋ง์ด ๋ฐ˜๋Œ€ํŽธ ์ฒซ์ค„์— ๋„์ฐฉํ•˜๋ฉด ์ข…๋ฃŒ๋œ๋‹ค.

์ฟผ๋ฆฌ๋„ ํ•„์Šน ์ „๋žต

ํ•ต์‹ฌ ๊ทœ์น™์€ ์ƒ๋Œ€๋ฐฉ์ด ๋ชป์›€์ง์ด๊ฒŒ ๊ฐ€๋‘๋ฉด ์•ˆ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. -> ์žฅ์• ๋ฌผ์„ ๋†“์„ ๋•Œ ์ตœ์†Œ 1์นธ์˜ ์›€์ง์ผ์ˆ˜ ์žˆ๋Š” ๊ณต๊ฐ„์„ ๋‚จ๊ฒจ์•ผ ํ•œ๋‹ค. ๋‚ด ๊ฒŒ์ž„๋ง ๋’ค์— ์žฅ์• ๋ฌผ์„ ๋‘”๋‹ค๋ฉด ๋‚ด ๊ฒฝ๋กœ์—๋Š” ์ง€์žฅ์„ ์ฃผ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์œ ๋ฆฌํ•˜๋‹ค.

์‹ค๋Ÿฌ ์˜คํ”„๋‹: ์ƒ๋Œ€๋ฐฉ๊ณผ ๋‚ด๊ฐ€ ๋ชจ๋‘ 3์นธ์„ ์ „์ง„ํ–ˆ๋‹ค๋ฉด, ๋‚ด ๊ฒŒ์ž„ ๋ง์˜ ๋’ค์— ์„ธ๋กœ๋กœ ์žฅ์• ๋ฌผ์„ ์„ธ์šฐ๋Š” ๊ฒƒ์ด๋‹ค. ๊ฒŒ์ž„๋ง์˜ ์›€์ง์ž„: ๋ฌดํ•œํžˆ ์‚ฌ์šฉํ• ์ˆ˜ ์žˆ๋Š” ์ž์›์ด๋‹ค. ์žฅ์• ๋ฌผ: ๊ฐœ์ˆ˜๊ฐ€ ํ•œ์ •๋˜์–ด ์žˆ๋Š” ์ž์›์ด๋‹ค.

์ฟผ๋ฆฌ๋„์˜ ์Šน๋ฆฌ ์ „๋žต์€ ๋‚ด๊ฐ€ ์›€์ง์—ฌ์•ผํ•˜๋Š” ๊ฑฐ๋ฆฌ๋Š” ์ตœ์†Œํ•œ์œผ๋กœ ๋งŒ๋“ค๊ณ  ์ƒ๋Œ€๋ฐฉ์ด ์›€์ง์—ฌ์•ผ ํ•˜๋Š” ๊ฑฐ๋ฆฌ๋Š” ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด๋‹ค.

์žฅ์• ๋ฌผ์ด ๋‚ด ์•ž์„ ๋ง‰์•˜์„ ๊ฒฝ์šฐ์—๋Š” ๋‚จ์€ ์นธ์ˆ˜๊ฐ€ ํ™€์ˆ˜์ธ ์นธ์œผ๋กœ ์ด๋™ํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•˜๋‹ค. ํ™€์ˆ˜์ธ ์นธ์„ ๋ง‰๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋” ๋งŽ์€ ์žฅ์• ๋ฌผ์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Š” ๊ฒŒ์ž„ํŒ์€ ๊ฐ€๋กœ ์„ธ๋กœ 9์นธ์”ฉ ์กด์žฌํ•˜์ง€๋งŒ ์žฅ์• ๋ฌผ์ด ๋ง‰์„์ˆ˜ ์žˆ๋Š” ์นธ ์ˆ˜๋Š” 2์นธ์ด๊ธฐ์— ํ•ญ์ƒ ํ•œ์นธ์ด ๋‚จ๊ฒŒ ๋˜๊ธฐ์— ๋‚จ์€ ์นธ์ด ํ™€์ˆ˜์ธ ๊ณณ์œผ๋กœ ์›€์ง์ด๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•˜๋‹ค. ๋˜ ํ•œ์นธ์ธ ์นธ์„ ๋ง‰๊ธฐ ์œ„ํ•ด์„œ๋Š” ์žฅ์• ๋ฌผ์ด 1๊ฐœ ์ด์ƒ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€๋ฐฉ์˜ ์žฅ์• ๋ฌผ ์‚ฌ์šฉ๋˜ํ•œ ์••๋ฐ• ํ• ์ˆ˜ ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ ์„ ์ˆ˜๋ฅผ ์žก์•˜์„ ๊ฒฝ์šฐ ์‹ค๋Ÿฌ ์˜คํ”„๋‹์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ ์ƒ๋Œ€์˜ ์™ผ์ชฝ ๊ธธ์„ ์ฐจ๋‹จํ•œํ›„ ์˜ค๋ฅธ์ชฝ ๊ธธ์„ ์••๋ฐ•ํ•˜๋ฉด ์‰ฝ๊ฒŒ ์ด๊ธธ์ˆ˜ ์žˆ๋‹ค.

์˜์‚ฌ ์ฝ”๋“œ ์„ค๋ช…

Initialize turnNum 0

FOR $until game is over, |n, N$:
  Set Player n%2
  Execute Player Action
  Add 1 to turnNum
END FOR

AI(DQN)

๋ฉ”์ปค๋‹ˆ์ฆ˜

๋ชฉ์  ๋ณด์ƒ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ action์„ ์˜ˆ์ธกํ•œ๋‹ค.

$Q'(s,a) = \underset{\pi}{\max}\mathbb{E}\left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... \vert s_t=s, a_t=a, \pi \right]$

์ด๋•Œ $r_t$๋Š” ๋ณด์ƒ์„ ์˜๋ฏธํ•˜๋ฉฐ $\gamma$๋Š” ํ˜„์žฌ ์ƒํƒœ์—์„œ ๋ฉ€์–ด์ง€๋Š” ์ƒํƒœ์— ๋Œ€ํ•œ action์— ๋”ฐ๋ฅธ ๋ณด์ƒ์— ๋Œ€ํ•œ ์ค‘์š”์„ฑ์„ ๋‚ฎ์ถ”๊ธฐ ์œ„ํ•œ ๊ฐ์†Œ์š”์ธ์œผ๋กœ ์ž‘์šฉํ•˜๋ฉฐ $t$๋Š” ๊ฐ time step์„ ์˜๋ฏธํ•œ๋‹ค.
$\pi = P(a \vert s)$ ๋กœ s์ƒํƒœ๊ฐ€ ์ผ์–ด๋‚ฌ์„๋•Œ action์„ ํ•  ํ™•๋ฅ  ์ฆ‰ ํ–‰๋™์— ์žฌํ•œ ์ •์ฑ…์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ $s$๋Š” ๊ด€์ธก๊ฐ’์„ ๋‚˜ํƒ€๋‚ด๊ณ  $a$๋Š” action์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.
์ด๋•Œ $Q$ํ•จ์ˆ˜๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ์‹œ์ผœ์•ผ ํ•˜๋Š”๋ฐ ์ •์ฑ…์œผ๋กœ ๋ถ€ํ„ฐ ์˜ˆ์ธก ๋˜๋Š” action์„ ํŒŒ๋ผ๋ฏธํ„ฐํ™” ํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ธ๋‹ค.

$Q(s,a; \theta_i)$

์ด๋•Œ $\theta$๋Š” deep convolutional neural network์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋ฉฐ $\theta_i$ ๋Š” $iteration$ $i$ ์—์„œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋‹ค.

DQN์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ด€์ธก ์‹œํ€€์Šค์— ๋”ฐ๋ฅธ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์—†์• ๊ธฐ ์œ„ํ•ด replay ๋ฐฉ์‹์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉฐ
์ด๋•Œ ๊ฐ ๊ด€์ธก๊ฒฝํ—˜ experience๋ฅผ $e_t = \set{s_t, a_t, r_t, s_{t+1}}$ ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.
์ด ๊ฒฝํ—˜์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋“ค์€ $D_t = \set{e_1, ... , e_t}$ ์— ์ €์žฅ๋œ๋‹ค.

์ด ์—์ด์ „ํŠธ๋ฅผ ํ•™์Šต ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์†์‹คํ•ฉ์ˆ˜(=๋น„์šฉํ•จ์ˆ˜)๊ฐ€ ํ•„์š”ํ•œ๋ฐ ์ด ๋น„์šฉํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$L_i\left(\theta_i\right)=\mathbb{E}_{\left(s,a,r,s'\right) \sim U(D)}\left[\left( r+\gamma \underset{a'}{\max}Q(s',a';\theta^-_i - Q(s,a;\theta_i)\right)^2\right]$

์ด ์ˆ˜์‹์—์„œ $\gamma$๋Š” ํ˜„์žฌ ์ƒํƒœ์—์„œ์˜ ํ–‰๋™์— ๋Œ€ํ•œ ๋ณด์ƒ์˜ ์˜ˆ์ธก๊ฐ’์ด ํ˜„์žฌ ์ƒํƒœ์™€ ๋ฉ€์–ด์งˆ์ˆ˜๋ก ๊ฐ์†Œ์‹œํ‚ค๋Š” ๊ฐ์†Œ์š”์ธ์ด ๋œ๋‹ค.
๋˜ํ•œ $\theta^-_i$๋Š” ๋„คํŠธ์›Œํฌ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฉฐ $target$๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด $iteration$ $i$์— ์‚ฌ์šฉ๋œ๋‹ค.

์˜์‚ฌ ์ฝ”๋“œ ์„ค๋ช…

Algorithm: DQN with experience replay.
Initialize replay memory D to capacity N
Initialize target action-value function $Q$ with random weights $\theta$
Initialize action-value function $\hat{Q}$ with weights $\theta^- = \theta$
For $episode=1, M$ do
       Initailize sequence $s_1 = \set{x_1}$ and preprocessed sequence $\phi=\phi(s_1)$
       For $t=1, T$ do
               With probability $\mathcal{E}$ select a random action $a_t$
               otherwise select $a_t = argmax_a Q(\phi(s_t),a;\theta)$
               Execute action $a_t$ in emulator and observe reward $r_t$ and image $x_{t+1}$
               Set $s_{t+1}=s_t,a_t,x_{t+1}$ and preprocess $\phi_{t+1}=\phi(s_{t+1})$
               Store transition $\left( \phi, a_t,r_t,\phi_{t+1} \right)$ in $D$
               Sample random minibatch of transitions $\left( \phi_j,a_j,r_j,\phi_{j+1}\right)$ from $D$
               If episode terminates at step j+1: Set $y_j=r_j$
               Else: Set $y_j=r_j+\gamma max_a'\hat{Q}(\phi_{j+1},a';\theta^-)$
               Perform a gradient descent step on $(y_j-Q(\phi_j,a_j;\theta))^2$ with respect to the network parameters $\theta$
               Every $C$ steps reset $\hat{Q} = Q$
       End For
End For

โš ๏ธ **GitHub.com Fallback** โš ๏ธ