Caffe Tutorial : 4.Solver (Kor) - ys7yoo/BrainCaffe GitHub Wiki

ํ•ด๊ฒฐ์‚ฌ (Solver)

ํ•ด๊ฒฐ์‚ฌ(solver)๋Š” ์†์‹ค์„ ํ–ฅ์ƒ์‹œํ‚ค๋ ค๋Š” ์‹œ๋„๋ฅผ ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ๋ฅผ ํ˜•์„ฑํ•˜๊ธฐ์œ„ํ•ด ๋„คํŠธ์›Œํฌ์˜ ์ •๋ฐฉํ–ฅ ์ถ”์ธก๊ณผ ์—ญ๋ฐฉํ–ฅ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์กฐ์งํ•ด์„œ ๋ชจ๋ธ ์ตœ์ ํ™”๋ฅผ ์กฐ์ ˆํ•œ๋‹ค. ํ•™์Šต์˜ ํ•„์ˆ˜ ์‚ฌํ•ญ๋“ค์€ ์ตœ์ ํ™”๋ฅผ ๊ฐ๋…ํ•˜๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ์œ„ํ•œ ํ•ด๊ฒฐ์‚ฌ์™€, ์†์‹ค๊ณผ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์‚ฐ์ถœํ•˜๊ธฐ์œ„ํ•œ ๋ง์œผ๋กœ ๋‚˜๋‰˜์–ด์ง„๋‹ค.

Caffe์˜ ํ•ด๊ฒฐ์‚ฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • Stochastic Gradient Descent ( type : "SGD" )
  • AdaDelta ( type : "AdaDelta" )
  • Adaptive Gradient (type: "AdaGrad"),
  • Adam (type: "Adam"),
  • Nesterovโ€™s Accelerated Gradient (type: "Nesterov") and
  • RMSprop (type: "RMSProp")
  • The Caffe solvers are:

ํ•ด๊ฒฐ์‚ฌ๋Š”

  1. ์ตœ์ ํ™” ๊ณผ์ • ๊ธฐ๋ก์˜ ๋ฐœํŒ์„ ๋งˆ๋ จํ•ด์ฃผ๊ณ  ํ•™์Šต์„ ์œ„ํ•œ ํ›ˆ๋ จ ๋„คํŠธ์›Œํฌ์™€ ํ‰๊ฐ€๋ฅผ ์œ„ํ•œ ์‹คํ—˜ ๋„คํŠธ์›Œํฌ๋ฅผ ์ƒ์„ฑํ•ด์ค€๋‹ค.
  2. ๋ฐ˜๋ณต์ ์œผ๋กœ ์ •๋ฐฉํ–ฅ / ์—ญ๋ฐฉํ–ฅ์„ ํ˜ธ์ถœํ•˜๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•จ์œผ๋กœ์จ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค.
  3. (์ฃผ๊ธฐ์ ์œผ๋กœ) ํ…Œ์ŠคํŠธ ๋„คํŠธ์›Œํฌ๋“ค์„ ํ‰๊ฐ€ํ•œ๋‹ค.
  4. ์ตœ์ ํ™” ๋‚ด๋‚ด ๋ชจ๋ธ๊ณผ ํ•ด๊ฒฐ์‚ฌ ์ƒํƒœ์˜ ์Šค๋ƒ…์ƒท์„ ์ฐ๋Š”๋‹ค.

๊ฐ ๋ฐ˜๋ณต๋งˆ๋‹ค ์ดˆ๊ธฐํ™”๋ถ€ํ„ฐ ํ•™์Šต๋œ ๋ชจ๋ธ๊นŒ์ง€ ๋ชจ๋“  ๋ฐฉ๋ฒ•์— ๊ฐ€์ค‘์น˜๋ฅผ ์ทจํ•˜๊ธฐ ์œ„ํ•ด

  1. ์†์‹ค๊ณผ ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•˜๊ธฐ์œ„ํ•ด ์ •๋ฐฉํ–ฅ ๋„คํŠธ์›Œํฌ๋ฅผ ํ˜ธ์ถœํ•œ๋‹ค.
  2. ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์—ญ๋ฐฉํ–ฅ ๋„คํŠธ์›Œํฌ๋ฅผ ํ˜ธ์ถœํ•œ๋‹ค.
  3. ํ•ด๊ฒฐ์‚ฌ ๋ฉ”์†Œ๋“œ์— ๋”ฐ๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ์†์— ๊ทธ๋ž˜๋””์–ธํŠธ๊ฐ€ ํฌํ•จ๋œ๋‹ค.
  4. ํ•™์Šต๋ฅ , ๊ธฐ๋ก, ๊ทธ๋ฆฌ๊ณ  ๋ฉ”์†Œ๋“œ์— ๋”ฐ๋ผ์„œ ํ•ด๊ฒฐ์‚ฌ ์ƒํƒœ๊ฐ€ ์—…๋ฐ์ดํŠธ ๋œ๋‹ค.

Caffe ๋ชจ๋ธ๋“ค๊ณผ ๊ฐ™์ด, Caffe ํ•ด๊ฒฐ์‚ฌ๋„ CPU์™€ GPU ๋ชจ๋“œ์—์„œ ์ž‘๋™ํ•œ๋‹ค.

1.๋ฉ”์†Œ๋“œ (Method)

ํ•ด๊ฒฐ์‚ฌ ๋ฉ”์†Œ๋“œ๋Š” ์†์‹ค ์ตœ์†Œํ™”์˜ ์ผ๋ฐ˜์  ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ๋ฐ์ดํ„ฐ์„ธํŠธ D์— ๋Œ€ํ•˜์—ฌ , ์ตœ์ ํ™” ๋ชฉ์ ์€ ๋ฐ์ดํ„ฐ ์…‹์„ ๊ฑธ์ณ ๋ชจ๋“  |D| ๋ฐ์ดํ„ฐ ์‚ฌ๋ก€์— ๋Œ€ํ•œ ์ „์ฒด ํ‰๊ท  ์†์‹ค์ด๋‹ค.

L(W) = \frac{1}{|D|} \sum_i^{|D|} f_W\left(X^{(i)}\right) + \lambda r(W)                        <-- TeX

์—ฌ๊ธฐ์„œ fW(X(i))๋Š” ๋ฐ์ดํ„ฐ ๊ฒฝ์šฐ์˜ ์ˆ˜์— ๋Œ€ํ•œ ์†์‹ค์ด๊ณ  r(W)๋Š” ๊ฐ€์ค‘์น˜ ฮป๋ฅผ ๊ฐ€์ง„ ์กฐ์งํ™” ํ•ญ(regularization term)์ด๋‹ค. |D|๋Š” ๋งค์šฐ ํด ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ทธ๋ž˜์„œ ์‹ค์ œ๋กœ๋Š”, ์šฐ๋ฆฌ๊ฐ€ ์ด ๋ชฉํ‘œ์˜ ํ™•์œจ์  ๊ทผ์‚ฌ์น˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ๊ฐ์˜ ํ•ด๊ฒฐ์‚ฌ ๋ฐ˜๋ณต์— ์žˆ์–ด, N<<|D| ๊ฒฝ์šฐ์˜ ์ตœ์†Œ ์ผํšŒ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทธ๋ฆฐ๋‹ค.

L(W) \approx \frac{1}{N} \sum_i^N f_W\left(X^{(i)}\right) + \lambda r(W)                        <-- TeX

๋ชจ๋ธ์€ ์ •๋ฐฉํ–ฅ๊ณผ์ •์—์„œ๋Š” fw๋ฅผ ์—ฐ์‚ฐํ•˜๊ณ , ์—ญ๋ฐฉํ–ฅ ๊ณผ์ •์—์„œ๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ โˆ‡fw๋ฅผ ์—ฐ์‚ฐํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ฮ”W๋Š” ์—๋Ÿฌ ๊ทธ๋ž˜๋””์–ธํŠธ โˆ‡fw, ์กฐ์งํ™” ๊ทธ๋ž˜๋””์–ธํŠธ(regularization gradient)โˆ‡r(W), ๊ทธ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ํŠน์ •ํ•œ ๊ฐ๊ฐ์˜ ๋ฉ”์†Œ๋“œ ๋ถ€ํ„ฐ์˜ ํ•ด๊ฒฐ์‚ฌ์— ์˜ํ•ด ์ƒ์„ฑ๋œ๋‹ค.

1. ํ™•์œจ๊ฒฝ์‚ฌํ•˜๊ฐ• SGD

ํ™•์œจ๊ฒฝ์‚ฌํ•˜๊ฐ•("SGD" ๋ผ๊ณ  ์นœ๋‹ค.)๋Š” ๋„ค๊ฑฐํ‹ฐ๋ธŒ ๊ทธ๋ž˜๋””์–ธํŠธ โˆ‡L(W)์™€ ์ด์ „์˜ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ Vt์˜ ์„ ํ˜• ํ•ฉ์„ฑ์— ์˜ํ•ด ๊ฐ€์ค‘์น˜ W๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํ•™์Šต์œจ ฮฑ๋Š” ๋„ค๊ฑฐํ‹ฐ๋ธŒ ๊ทธ๋ž˜๋””์–ธํŠธ์˜ ๊ฐ€์ค‘์น˜์ด๋ฉฐ ๋ชจ๋ฉ˜ํ…€ ฮผ์€ ์ด์ „ ์—…๋ฐ์ดํŠธ์˜ ๊ฐ€์ค‘์น˜์ด๋‹ค. ํ˜•์‹์ ์œผ๋กœ, ์ด์ „ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ Vt์™€ ํ˜„์žฌ ๊ฐ€์ค‘์น˜ Wt๋ฅผ ๊ณ ๋ คํ•˜์—ฌ, ๋ฐ˜๋ณต t+1์—์„œ ์—…๋ฐ์ดํŠธ ๋œ ๊ฐ€์ค‘์น˜ Wt+1์™€ ์—…๋ฐ์ดํŠธ ๊ฐ’ Vt+1์„ ์—ฐ์‚ฐํ•˜๊ธฐ์œ„ํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณต์‹์ด ์žˆ๋‹ค.

Vt+1=ฮผVtโˆ’ฮฑโˆ‡L(Wt)
Wt+1=Wt+Vt+1

"ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ" (ฮฑ ์™€ ฮผ)๋ฅผ ํ•™์Šตํ•˜๋Š”๊ฒƒ์€ ์ตœ๋Œ€์˜ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์•ฝ๊ฐ„์˜ ์กฐ์œจ์ด ์š”๊ตฌ๋ ์ง€ ๋ชจ๋ฅธ๋‹ค. ๋งŒ์•ฝ ์–ด๋””์„œ ์‹œ์ž‘ํ• ์ง€์— ๋Œ€ํ•œ ํ™•์‹ ์ด ์—†๋‹ค๋ฉด, ์•„๋ž˜ "์—„์ง€์†๊ฐ€๋ฝ์˜ ๊ทœ์น™"์„ ๋ณด๊ณ ์˜ค๋ผ, ๊ทธ๋ฆฌ๊ณ  ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด Leon Bottou ์ €์˜ ํ™•์œจ์  ๊ธฐ์šธ๊ธฐ ๊ฐ•ํ•˜ ์†์ž„์ˆ˜ (Stochastic Gradient Descent Tricks)๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค. #######[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012.

ํ•™์Šต์œจฮฑ ์™€ ๋ชจ๋ฉ˜ํ…€ฮผ ์„ค์ •์„ ์œ„ํ•œ ์—„์ง€์†๊ฐ€๋ฝ ๊ทœ์น™ (Rules of thumb for setting the learning rate ฮฑ and momentum ฮผ)

SGD๋กœ ํ•˜๋Š” ์‹ฌ์ธตํ•™์Šต์„ ์œ„ํ•œ ์ข‹์€ ์ „๋žต์€ ์†์‹ค์ด ํ™•์‹คํ•œ "์•ˆ์ •๊ธฐ"์— ๋‹ค๊ฐ€๊ฐ€๊ธฐ ์‹œ์ž‘ํ• ๋•Œ ํ•™์Šต์‹œํ‚ค๋Š” ๋‚ด๋‚ด ์ƒ์ˆ˜ ์š”์†Œ (10 ๊ฐ™์€)์— ์˜ํ•ด ํ•™์Šต์œจ์„ ๋‚ฎ์ถ”๋ฉด์„œ ฮฑโ‰ˆ0.01=10^(โˆ’2) ์ฃผ์œ„ ๊ฐ’์— ํ•™์Šต์œจฮฑ ์„ ์ดˆ๊ธฐํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์•„๋งˆ ๋ชจ๋ฉ˜ํ…€ ฮผ=0.9์ด๋‚˜ ์ด์™€ ๋น„์Šทํ•œ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๊ธธ ์›ํ• ์ˆ˜๋„ ์žˆ๋‹ค. ๋ฐ˜๋ณต์„ ํ†ตํ•œ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ๊ณ ๋ฃจ๊ฒŒ ํ•จ์— ์˜ํ•ด, ๋ชจ๋ฉ˜ํ…€์€ ๋” ์•ˆ์ •์ ์ด๊ณ  ๋” ๋น ๋ฅธ SGD๋กœ ํ•˜๋Š” ์‹ฌ์ธตํ•™์Šต์„ ์ด๋ฃจ๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ์ด๊ฒƒ์€ Krizhevsky์™€ ๋“ฑ๋“ฑ์— ์˜ํ•ด ์‚ฌ์šฉ๋œ ์ „๋žต์ด๋‹ค. ILSVRC-2012๋Œ€ํšŒ์—์„œ CNN ์—”ํŠธ๋ฆฌ๋กœ๋ถ€ํ„ฐ ์Šน๋ฆฌํ•œ [1]์™€ Caffe๋Š” ์ด ์ „๋žต์„ SolverParameter์—์„œ ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด์™€ ๊ฐ™์€ ํ•™์Šต์œจ ์ •์ฑ…์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด, solver prototxt ํŒŒ์ผ์•ˆ์˜ ์–ด๋–ค ๊ณณ์ด๋‚˜ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ผ์ธ์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

base_lr: 0.01     # 0.01 = 1e-2์˜ ํ•™์Šต์œจ๋กœ ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•œ๋‹ค. 

lr_policy: "step" # ํ•™์Šต์œจ ๊ทœ์น™ : "๋‹จ๊ณ„์ ์œผ๋กœ" ํ•™์Šต์œจ์„ ํ•˜๋ฝ์‹œํ‚จ๋‹ค.
                  # ๋ชจ๋“  ๋‹จ๊ณ„ ํฌ๊ธฐ ๋ฐ˜๋ณต ๊ฐ๋งˆ์˜ ์š”์†Œ์— ์˜ํ•ด

gamma: 0.1        # 10 ์š”์†Œ์— ์˜ํ•ด ํ•™์Šต์œจ์„ ํ•˜๊ฐ•์‹œํ‚จ๋‹ค.
                  # (i.e., multiply it by a factor of gamma = 0.1)

stepsize: 100000  # ๋งค 10๋งŒ๋ฒˆ ๋ฐ˜๋ณตํ•  ๋•Œ๋งˆ๋‹ค ํ•™์Šต์œจ์„ ํ•˜๊ฐ•์‹œํ‚จ๋‹ค.

max_iter: 350000  # ์ „์ฒด 35๋งŒ๋ฒˆ ๋ฐ˜๋ณตํ•˜์—ฌ ํ›ˆ๋ จํ•œ๋‹ค.

momentum: 0.9

์œ„์˜ ์„ค์ • ํ•˜์—, ์šฐ๋ฆฌ๋Š” ํ•ญ์ƒ ๋ชจ๋ฉ˜ํ…€ ฮผ=0.9์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” ์ฒซ 10๋งŒ๋ฒˆ ๋ฐ˜๋ณต์— ๋Œ€ํ•ด ฮฑ=0.01=10^(โˆ’2)์˜ "base_lr"์—์„œ ํ•™์Šต์„ ์‹œ์ž‘ํ•  ๊ฒƒ์ด๊ณ , ๊ทธ๋ฆฌ๊ณ ๋‚˜์„œ ๊ฐ๋งˆ(ฮณ)๋ฅผ ํ•™์Šต์œจ์— ๊ณฑ์…ˆํ•˜๊ณ  10๋งŒ๋ฒˆ20๋งŒ๋ฒˆ ๋ฐ˜๋ณต์— ๋Œ€ํ•˜์—ฌ ฮฑโ€ฒ=ฮฑฮณ=(0.01)(0.1)=0.001=10โˆ’3์—์„œ ํ•™์Šต์„ ํ•˜๊ณ , 20๋งŒ๋ฒˆ30๋งŒ๋ฒˆ ๋ฐ˜๋ณต์— ๋Œ€ํ•ด์„œ๋Š” ฮฑโ€ฒโ€ฒ=10^(โˆ’4)์—์„œ, ๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์œผ๋กœ 350๋ฒˆ์งธ ๋ฐ˜๋ณต๊นŒ์ง€๋Š” (์šฐ๋ฆฌ๊ฐ€ max_iter: 350000๋กœ ์„ค์ •ํ•ด ๋‘์—ˆ๊ธฐ์—) ฮฑโ€ฒโ€ฒโ€ฒ=10^(โˆ’5)์—์„œ ํ•™์Šตํ•œ๋‹ค.

๋ชจ๋ฉ˜ํ…€ ์„ธํŒ… ฮผ๊ฐ€ ์ˆ˜๋งŽ์€ ํ•™์Šต์˜ ๋ฐ˜๋ณต ํ›„์— 11โˆ’ฮผ์˜ ์š”์†Œ์— ์˜ํ•ด ์—…๋ฐ์ดํŠธ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณฑ์…ˆํ•˜๋Š”๋ฐ, ๊ทธ๋ž˜์„œ ๋งŒ์•ฝ ฮผ๋ฅผ ์˜ฌ๋ฆฌ๊ธฐ๋ฅผ ์›ํ•œ๋‹ค๋ฉด, ฮฑ ์— ์ƒ์‘ํ•˜์—ฌ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์€ ์ข‹์€ ์ƒ๊ฐ์ด๋‹ค. (์—ญ์œผ๋กœ๋„ ๊ฐ™์Œ) ์˜ˆ๋ฅผ๋“ค๋ฉด, ฮผ=0.9๋กœ, ์šฐ๋ฆฌ๋Š” 11โˆ’0.9=10์˜ ํšจ์œจ์  ์—…๋ฐ์ดํŠธ ์‚ฌ์ด์ฆˆ ์Šน์ˆ˜๋ฅผ ๊ฐ€์ง„๋‹ค. ๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ ๋ชจ๋ฉ˜ํ…€์„ ฮผ=0.99๋กœ ์˜ฌ๋ฆฐ๋‹ค๋ฉด, ์šฐ๋ฆฌ๋Š” ์šฐ๋ฆฌ์˜ ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ ์Šน์ˆ˜๋ฅผ 100๊นŒ์ง€ ์˜ฌ๋ฆฌ๋ฏ€๋กœ, ์šฐ๋ฆฌ๋Š” 10 ์š”์†Œ์— ์˜ํ•ด (base_lr) ฮฑ๋ฅผ ํ•˜๋ฝ์‹œ์ผœ์•ผ๋งŒํ•œ๋‹ค.

๋˜ํ•œ ์œ„์˜ ์„ค์ •์€ ๋‹จ์ง€ ๊ฐ€์ด๋“œ๋ผ์ธ์ด๋ฉฐ, ๋ถ„๋ช…ํžˆ ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ์œ„ ์„ค์ •์ด ์ตœ์ ์ด๋ผ๋Š” ๋ณด์žฅ์ด์—†๋‹ค. ๋งŒ์•ฝ ํ•™์Šตํ•˜๋Š”๊ฒƒ์ด ๋‚˜๋‰˜๋ฉด base_lr(์˜ˆ๋ฅผ๋“ค๋ฉด base_lr: 0.001)๋ฅผ ๋‚ฎ์ถ”๊ฑฐ๋‚˜ ์•„๋‹ˆ๋ฉด ์žฌ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‚˜ ์ ๋‹นํ•œ base_lr ๊ฐ’์„ ์ฐพ์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•ด๋ณด์•„๋ผ.

#######[1] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 2012.

2. AdaDelta

The AdaDelta ("AdaDelta"๋ผ๊ณ  ์ž…๋ ฅํ•˜๋Š”) ๋ฉ”์†Œ๋“œ (M. Zeiler [1])๋Š” ํ™œ๋ฐœํ•œ ํ•™์Šต์œจ ๋ฉ”์†Œ๋“œ(robust learning rate method)์ด๋‹ค. (SGD ๊ฐ™์ด) ์ด๊ฒƒ์€ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๋ฉ”์†Œ๋“œ์ด๋‹ค. ์—…๋ฐ์ดํŠธ ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

% <![CDATA[
\begin{align}
(v_t)_i &= \frac{\operatorname{RMS}((v_{t-1})_i)}{\operatorname{RMS}\left( \nabla L(W_t) \right)_{i}} \left( \nabla L(W_{t'}) \right)_i
\\
\operatorname{RMS}\left( \nabla L(W_t) \right)_{i} &= \sqrt{E[g^2] + \varepsilon}
\\
E[g^2]_t &= \delta{E[g^2]_{t-1} } + (1-\delta)g_{t}^2
\end{align} %]]>

(W_{t+1})_i = (W_t)_i - \alpha (v_t)_i.
 

#######[1] M. Zeiler ADADELTA: AN ADAPTIVE LEARNING RATE METHOD. arXiv preprint, 2012.

3. AdaGrad

์กฐ์ •ํ•˜๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ ๋ฉ”์†Œ๋“œ(adaptive gradient method) ("AdaGrad"๋ผ๊ณ  ์นœ๋‹ค.)๋Š” Duchi์™€ ๊ทธ์˜ ๋™๋ฃŒ๋“ค์˜ ๋ง์— ๋”ฐ๋ฅด๋ฉด "์˜ˆ์ธก์ด ๋งค์šฐ ๋›ฐ์–ด๋‚˜์ง€๋งŒ ๊ฑฐ์˜ ํŠน์ง•์ด ๋ณด์ด์ง€ ์•Š๋Š” ํ˜•ํƒœ์—์„œ์˜ ๊ฑด์ดˆ๋”๋ฏธ์•ˆ์˜ ๋ฐ”๋Š˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ"๊ณผ๊ฐ™์€ ์„ ์‹œ๋„ํ•˜๋Š” (SGD์™€ ๊ฐ™์€) ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๋ฉ”์†Œ๋“œ์ด๋‹ค. tโ€ฒโˆˆ{1,2,...,t}์— ๋Œ€ํ•œ (โˆ‡L(W))tโ€ฒ์ธ ์ „์ฒด ์ด์ „์˜ ๋ฐ˜๋ณต๋“ค๋กœ ๋ถ€ํ„ฐ ์—…๋ฐ์ดํŠธ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜์ž๋ฉด, ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜W์˜ ์š”์†Œ i์— ๋Œ€ํ•ด ๋ช…์‹œ๋œ [1]์— ์˜ํ•ด ์ œ์‹œ๋œ ๊ณต์‹์ด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

(W_{t+1})_i =
(W_t)_i - \alpha
\frac{\left( \nabla L(W_t) \right)_{i}}{
    \sqrt{\sum_{t'=1}^{t} \left( \nabla L(W_{t'}) \right)_i^2}
}

์‹ค์ œ๋กœ๋Š”, ๊ฐ€์ค‘์น˜ WโˆˆRd์— ๋Œ€ํ•˜์—ฌ, (Caffe์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ๋„ ํฌํ•จํ•ด์„œ) AdaGrad ์ˆ˜ํ–‰๋“ค์€ ๊ธฐ๋ก๋œ ๊ทธ๋ž˜๋””์–ธํŠธ ์ •๋ณด์ €์žฅ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์˜ ์ €์žฅ์ธ ์˜ค์ง O(d)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. (๊ฐ๊ฐ์˜ ๊ธฐ๋ก๋œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ฐœ๋ณ„์ ์œผ๋กœ ์ €์žฅํ•ด์•ผ๋งŒ ํ•˜๋Š” O(dt)๋ณด๋‹ค๋Š” ) #######[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.

4. Adam

kingma์™€ ๊ทธ์˜ ๋™๋ฃŒ๋“ค์ด ์ œ์‹œํ•œ [1], Adam ("Adam"์ด๋ผ๊ณ  ์นœ๋‹ค) ์€ SGD๊ฐ™์ด ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™” ๋ฉ”์†Œ๋“œ์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ "์กฐ์ •ํ•˜๋Š” ๋ชจ๋ฉ˜ํŠธ ํ‰๊ฐ€์น˜(adaptive moment estimation)" (mt,vtmt,vt)๋ฅผ ํฌํ•จํ•˜๋ฉฐ AdaGrad์˜ ์ผ๋ฐ˜ํ™”๋กœ์จ ๊ฐ„์ฃผ๋  ์ˆ˜ ์žˆ๋‹ค. ์—…๋ฐ์ดํŠธ ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

(m_t)_i = \beta_1 (m_{t-1})_i + (1-\beta_1)(\nabla L(W_t))_i,\\
(v_t)_i = \beta_2 (v_{t-1})_i + (1-\beta_2)(\nabla L(W_t))_i^2

(W_{t+1})_i =
(W_t)_i - \alpha \frac{\sqrt{1-(\beta_2)_i^t}}{1-(\beta_1)_i^t}\frac{(m_t)_i}{\sqrt{(v_t)_i}+\varepsilon}.

Kingma ์™€ ๊ทธ์˜ ๋™๋ฃŒ๋“ค์ด ์ œ์‹œํ•œ [1]์—์„œ๋Š” ฮฒ1=0.9,ฮฒ2=0.999,ฮต=10โˆ’8 ๋ฅผ ๋””ํดํŠธ ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•˜๋ผ๊ณ  ์ œ์‹œํ–ˆ๋‹ค. Caffe๋Š” ๊ฐ๊ฐ ฮฒ1,ฮฒ2,ฮตฮฒ1,ฮฒ2,ฮต์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋ฉ˜ํ…€, ๋ชจ๋ฉ˜ํ…€2 ๋ธํƒ€๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

#######[1] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.

4. NAG (Nesterovโ€™s accelerated gradient)

๋„ค์ŠคํŠธ๋กœ๋ธŒ์˜ ๊ฐ€์†๋œ ๊ทธ๋ž˜๋””์–ธํŠธ ("Nesterov"๋ผ๊ณ  ์นœ๋‹ค.)๋Š” O(1/t)๋ณด๋‹ค O(1/(t^2))์˜ ์ˆ˜๋ ด๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ ๋ณผ๋กํ•œ ์ตœ์ ํ™”(convex optimization)์˜ "์ตœ์ ์˜" ๋ฐฉ๋ฒ•์œผ๋กœ์จ Nesterov๋Š” [1]์„ ์ œ์‹œํ–ˆ๋‹ค. ๋น„๋ก ์ˆ˜๋ ด O(1/t2)๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ์œ„ํ•ด ํ•„์š”๋กœํ•˜๋Š” ์†Œ๋น„๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ Caffe๋กœ ํ›ˆ๋ จ์‹œํ‚จ ์‹ฌ์ธต ๋„คํŠธ์›Œํฌ๋“ค์— ์ž๋ฆฌ์žก์ง€๋Š” ์•Š๋”๋ผ๋„, Sutskever์™€ ๊ทธ์˜ ๋™๋ฃŒ๋“ค์ด deep MNIST autoencoders [2]๋ฅผ ๋ฌ˜์‚ฌํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ์‹ค์ œ NAG๋Š” ์‹ฌ์ธตํ•™์Šต ๊ตฌ์กฐ์˜ ํŠน์ •ํ•œ ํƒ€์ž…๋“ค์„ ์ตœ์ ํ™”ํ•˜๋Š”๋ฐ ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ๊ณต์‹์€ ์œ„์˜ SGD ์—…๋ฐ์ดํŠธ์—์„œ ๋ณด์ธ ๊ฒƒ๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๋‹ค.

V_{t+1} = \mu V_t - \alpha \nabla L(W_t + \mu V_t)
W_{t+1} = W_t + V_{t+1}

SGD ๋ฉ”์†Œ๋“œ์™€ ๊ตฌ๋ณ„๋˜๋Š” ์ด ๋ฐฉ๋ฒ•์€ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ„๋‹จํžˆ ํ˜„์žฌ ๊ฐ€์ค‘์น˜ ๊ทธ๋“ค ์ž์ฒด์—์„œ ๊ทธ๋ž˜๋””์–ธํŠธโˆ‡L(Wt)๋ฅผ ์ทจํ•˜๋Š” SGD์—์„œ, ํ˜น์€ ์šฐ๋ฆฌ๊ฐ€ ์ถ”๊ฐ€๋œ ๋ชจ๋ฉ˜ํ…€ โˆ‡L(Wt+ฮผVt)์œผ๋กœ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ทจํ•˜๋Š” NAG์—์„œ, ์šฐ๋ฆฌ๊ฐ€ ์—๋Ÿฌ ๊ทธ๋ž˜๋””์–ธํŠธ โˆ‡L(W)๋ฅผ ๊ณ„์‚ฐํ•œ ๊ฒƒ์— ๋Œ€ํ•œ W๋ฅผ ์„ค์ •ํ•˜๋Š” ๊ฐ€์ค‘์น˜์ด๋‹ค.

######[1] Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/kโˆ’โˆ’โˆš)O(1/k). Soviet Mathematics Doklady, 1983.

######[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the 30th International Conference on Machine Learning, 2013.

5. RMSprop

RMSprop("RMSProp"๋ผ๊ณ  ์นœ๋‹ค)๋Š” ์ฝ”์„ธ๋ผ ๊ณผ์ • ๊ฐ•์˜(Coursera course lecture)์—์„œ Tieleman ์ด ์ œ์‹œํ•œ ๊ฒƒ์ด๋ฉฐ ์ด๋Š” SGD ์ฒ˜๋Ÿผ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ธฐ๋ฐ˜์˜ ์ตœ์ ํ™”์ด๋ฉฐ ์—…๋ฐ์ดํŠธ ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\operatorname{MS}((W_t)_i)= \delta\operatorname{MS}((W_{t-1})_i)+ (1-\delta)(\nabla L(W_t))_i^2 \\
(W_{t+1})_i= (W_{t})_i -\alpha\frac{(\nabla L(W_t))_i}{\sqrt{\operatorname{MS}((W_t)_i)}}

(rms_decay) ฮด์˜ ๋””ํดํŠธ ๊ฐ’์€ ฮด=0.99๋กœ ์„ค์ •๋˜์–ด ์žˆ๋‹ค.

[1] T. Tieleman, and G. Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.Technical report, 2012.

2. ๋ฐœํŒ ๋งˆ๋ จํ•˜๊ธฐ (Scaffolding)

๋ฐœํŒ์„ ๋งˆ๋ จํ•˜๋Š” ํ•ด๊ฒฐ์‚ฌ๋Š” "Solver::Presolve()"์—์„œ ํ•™์Šต๋˜์–ด์ง€๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋ฉ”์†Œ๋“œ ์ตœ์ ํ™”๋ฅผ ์ค€๋น„ํ•œ๋‹ค.

> caffe train -solver examples/mnist/lenet_solver.prototxt
I0902 13:35:56.474978 16020 caffe.cpp:90] Starting Optimization
I0902 13:35:56.475190 16020 solver.cpp:32] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
solver_mode: GPU
net: "examples/mnist/lenet_train_test.prototxt"
  • ๋ง ์ดˆ๊ธฐํ™” (Net initialization)
I0902 13:35:56.655681 16020 solver.cpp:72] Creating training net from net file: examples/mnist/lenet_train_test.prototxt
[...]
I0902 13:35:56.656740 16020 net.cpp:56] Memory required for data: 0
I0902 13:35:56.656791 16020 net.cpp:67] Creating Layer mnist
I0902 13:35:56.656811 16020 net.cpp:356] mnist -> data
I0902 13:35:56.656846 16020 net.cpp:356] mnist -> label
I0902 13:35:56.656874 16020 net.cpp:96] Setting up mnist
I0902 13:35:56.694052 16020 data_layer.cpp:135] Opening lmdb examples/mnist/mnist_train_lmdb
I0902 13:35:56.701062 16020 data_layer.cpp:195] output data size: 64,1,28,28
I0902 13:35:56.701146 16020 data_layer.cpp:236] Initializing prefetch
I0902 13:35:56.701196 16020 data_layer.cpp:238] Prefetch initialized.
I0902 13:35:56.701212 16020 net.cpp:103] Top shape: 64 1 28 28 (50176)
I0902 13:35:56.701230 16020 net.cpp:103] Top shape: 64 1 1 1 (64)
[...]
I0902 13:35:56.703737 16020 net.cpp:67] Creating Layer ip1
I0902 13:35:56.703753 16020 net.cpp:394] ip1 <- pool2
I0902 13:35:56.703778 16020 net.cpp:356] ip1 -> ip1
I0902 13:35:56.703797 16020 net.cpp:96] Setting up ip1
I0902 13:35:56.728127 16020 net.cpp:103] Top shape: 64 500 1 1 (32000)
I0902 13:35:56.728142 16020 net.cpp:113] Memory required for data: 5039360
I0902 13:35:56.728175 16020 net.cpp:67] Creating Layer relu1
I0902 13:35:56.728194 16020 net.cpp:394] relu1 <- ip1
I0902 13:35:56.728219 16020 net.cpp:345] relu1 -> ip1 (in-place)
I0902 13:35:56.728240 16020 net.cpp:96] Setting up relu1
I0902 13:35:56.728256 16020 net.cpp:103] Top shape: 64 500 1 1 (32000)
I0902 13:35:56.728270 16020 net.cpp:113] Memory required for data: 5167360
I0902 13:35:56.728287 16020 net.cpp:67] Creating Layer ip2
I0902 13:35:56.728304 16020 net.cpp:394] ip2 <- ip1
I0902 13:35:56.728333 16020 net.cpp:356] ip2 -> ip2
I0902 13:35:56.728356 16020 net.cpp:96] Setting up ip2
I0902 13:35:56.728690 16020 net.cpp:103] Top shape: 64 10 1 1 (640)
I0902 13:35:56.728705 16020 net.cpp:113] Memory required for data: 5169920
I0902 13:35:56.728734 16020 net.cpp:67] Creating Layer loss
I0902 13:35:56.728747 16020 net.cpp:394] loss <- ip2
I0902 13:35:56.728767 16020 net.cpp:394] loss <- label
I0902 13:35:56.728786 16020 net.cpp:356] loss -> loss
I0902 13:35:56.728811 16020 net.cpp:96] Setting up loss
I0902 13:35:56.728837 16020 net.cpp:103] Top shape: 1 1 1 1 (1)
I0902 13:35:56.728849 16020 net.cpp:109]     with loss weight 1
I0902 13:35:56.728878 16020 net.cpp:113] Memory required for data: 5169924
  • ์†์‹ค (Loss)
I0902 13:35:56.728893 16020 net.cpp:170] loss needs backward computation.
I0902 13:35:56.728909 16020 net.cpp:170] ip2 needs backward computation.
I0902 13:35:56.728924 16020 net.cpp:170] relu1 needs backward computation.
I0902 13:35:56.728938 16020 net.cpp:170] ip1 needs backward computation.
I0902 13:35:56.728953 16020 net.cpp:170] pool2 needs backward computation.
I0902 13:35:56.728970 16020 net.cpp:170] conv2 needs backward computation.
I0902 13:35:56.728984 16020 net.cpp:170] pool1 needs backward computation.
I0902 13:35:56.728998 16020 net.cpp:170] conv1 needs backward computation.
I0902 13:35:56.729014 16020 net.cpp:172] mnist does not need backward computation.
I0902 13:35:56.729027 16020 net.cpp:208] This network produces output loss
I0902 13:35:56.729053 16020 net.cpp:467] Collecting Learning Rate and Weight Decay.
I0902 13:35:56.729071 16020 net.cpp:219] Network initialization done.
I0902 13:35:56.729085 16020 net.cpp:220] Memory required for data: 5169924
I0902 13:35:56.729277 16020 solver.cpp:156] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt
  • Completion
I0902 13:35:56.806970 16020 solver.cpp:46] Solver scaffolding done.
I0902 13:35:56.806984 16020 solver.cpp:165] Solving LeNet

3. ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ (Updating Parameters)

์‹ค์ œ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋Š” ํ•ด๊ฒฐ์‚ฌ์— ์˜ํ•ด ๋งŒ๋“ค์–ด์ง„ ๋’ค, "Solver::ComputeUpdateValue()"์—์„œ ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ ์šฉ๋œ๋‹ค. "ComputeUpdateValue" ๋ฉ”์†Œ๋“œ๋Š” ๊ฐ๊ฐ์˜ ๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•˜์—ฌ ์ตœ์ข…์  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ์ทจํ•˜๋Š” (ํ˜„์žฌ ์—๋Ÿฌ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” )๊ฐ€์ค‘์น˜ ๊ทธ๋ž˜๋””์–ธํŠธ์†์— ์–ด๋–ค ์ค‘๋Ÿ‰์น˜ ๊ฐ์†Œ r(W)๋ฅผ ํฌํ•จํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋‚˜์„œ ์ด๋Ÿฌํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š” ๊ฐ๊ฐ์˜ Bolb์˜ diff ํ•„๋“œ ํŒŒ๋ผ๋ฏธํ„ฐ์•ˆ์— ์ €์žฅ๋œ ๋บ„์…ˆ ์—…๋ฐ์ดํŠธ์™€ ํ•™์Šต์œจ ฮฑ์— ์˜ํ•ด ์ƒ์Šน๋˜์–ด์ง„๋‹ค. ์ตœ์ข…์ ์œผ๋กœ "Blob::Update" ๋ฉ”์†Œ๋“œ๋Š” ๊ฐ๊ฐ์˜ blob ํŒŒ๋ผ๋ฏธํ„ฐ์— ํ˜ธ์ถœ๋˜๋ฉฐ, ์ด๋Š” ์ตœ์ข… ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. (๋ฐ์ดํ„ฐ๋กœ ๋ถ€ํ„ฐ Blob์˜ diff๋ฅผ ๋นผ๋ฉด์„œ)

4. Snapshotting and Resuming

ํ•ด๊ฒฐ์‚ฌ๋Š” "Solver::Snapshot()"์™€ "Solver::SnapshotSolverState()"์—์„œ ํ•™์Šตํ•˜๋Š” ๋™์•ˆ ๊ฐ€์ค‘์น˜์™€ ๊ฐ€์ค‘์น˜์˜ ์ƒํƒœ๋ฅผ ์Šค๋ƒ…์ƒท์œผ๋กœ ์ฐ๋Š”๋‹ค. ํ•ด๊ฒฐ์‚ฌ ์Šค๋ƒ…์ƒท์ด ์ฃผ์–ด์ง„ ์ง€์ ์œผ๋กœ๋ถ€ํ„ฐ ์žฌํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜๋Š” ๋™์•ˆ์— ๊ฐ€์ค‘์น˜ ์Šค๋ƒ…์ƒท์€ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๋‚ด๋ณด๋‚ธ๋‹ค. ํ›ˆ๋ จ์€ "Solver::Restore()"์™€ "Solver::RestoreSolverState()"์— ์˜ํ•ด ์žฌํ•™์Šต๋˜์–ด์ง„๋‹ค. ํ•ด๊ฒฐ์‚ฌ ์ƒํƒœ๊ฐ€ ".solverstate" ํ™•์žฅ์— ์ €์žฅ๋˜๋Š” ๋™์•ˆ ๊ฐ€์ค‘์น˜๋“ค์€ ํ™•์žฅ์—†์ด ์ €์žฅ๋œ๋‹ค. ์–‘์ชฝ ํŒŒ์ผ ๋ชจ๋‘ ์Šค๋ƒ…์ƒท ๋ฐ˜๋ณต ์ˆ˜์— ๋Œ€ํ•˜์—ฌ ์ ‘๋ฏธ์‚ฌ "_iter_N"๋ฅผ ๊ฐ€์ง„๋‹ค. ์Šค๋ƒ…์ƒท์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •๋˜์–ด์ง„๋‹ค.

# ๋ฐ˜๋ณต์—์„œ ์Šค๋ƒ…์ƒท ๊ฐ„๊ฒฉ
snapshot: 5000
# ๋ชจ๋ธ ๊ฐ€์ค‘์น˜์™€ ํ•ด๊ฒฐ์‚ฌ ์ƒํƒœ๋ฅผ ์Šค๋ƒ…์ƒท์œผ๋กœ ์ฐ์–ด๋†“์€ ๊ฒƒ์— ๋Œ€ํ•œ ํŒŒ์ผ ๊ฒฝ๋กœ ์ ‘๋ฏธ์‚ฌ
# ์ด๋Š” 'Caffe' ๋„๊ตฌ๊ฐ€ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ด€๋ จ์žˆ์œผ๋ฉฐ ํ•ด๊ฒฐ์‚ฌ ์ •์˜ ํŒŒ์ผ๊ณผ๋Š” ๋ฌด๊ด€ํ•˜๋‹ค.
snapshot_prefix: "/path/to/model"
# ๊ฐ€์ค‘์น˜์—๋”ฐ๋ผ diff๋ฅผ ์Šค๋ƒ…์ƒท์œผ๋กœ ์ฐ์œผ๋ฉฐ ์ด๋Š” ํ•™์Šต์˜ ๋””๋ฒ„๊น…์— ๋„์›€์„ ์ฃผ์ง€๋งŒ ์ €์žฅ์šฉ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•œ๋‹ค.
# ์ตœ์ข… ์Šค๋ƒ…์ƒท์€ ์ด ํ”Œ๋ž˜๊ทธ๊ฐ€ false๋ผ๊ณ  ์„ค์ •ํ•˜์ง€ ์•Š๋Š”ํ•œ ํ•™์Šต์˜ ๋์— ์ €์žฅ๋  ๊ฒƒ์ด๋‹ค. ๋””ํดํŠธ๊ฐ’์€๋Š” true๋‹ค.
snapshot_after_train: true
ํŠœํ† ๋ฆฌ์–ผ ๋ฉ”๋‰ด๋กœ ๋Œ์•„๊ฐ€๊ธฐ