Distributed Learning - leemik3/tensorflow-2.0 GitHub Wiki

๋”ฅ๋Ÿฌ๋‹ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ [์ž‘์—… ๋ถ„ํ•  ๋ฐฉ์‹์— ๋”ฐ๋ผ]

  • Data Parallelism ๏ฎ ๊ฐ ๋ฐ์ดํ„ฐ๋“ค์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ gradient (transfer ์ด๋ฃจ์–ด์ง) ํ‰๊ท ์„ ํ†ตํ•ด (Parameter Server, AllReduce๋ฅผ ํ†ตํ•ด) ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ๏ฎ GPU memory limitation : ์•„๋งˆ๋„ ๋ชจ๋ธ ์ž์ฒด๋Š” ๊ทธ๋Œ€๋กœ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ ๋‚ด ์—ฐ์‚ฐ์ด ๋งŽ์•„์„œ ๏ฎ ๊ฐ Worker : ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ, ๊ฐ™์€ ๋ชจ๋ธ ๏ฎ ๋…ผ๋ฌธ :

  • Model Parallelism ๏ฎ partial activation ์„ ๋‹ค์Œ layer, ๋‹ค์Œ worker์— ์ „๋‹ฌ ๏ฎ GPU memory๋Š” ๊ดœ์ฐฎ: ๋ชจ๋ธ์ด ๋‚˜๋ˆ ์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ๋ชจ๋ธ ๋‚ด ์—ฐ์‚ฐ์ด ์ค„์–ด๋“ค๊ฒŒ ๋˜๋‹ˆ๊น ๏ฎ GPU utilization low : ๋ชจ๋ธ ๋‚ด ์—ฐ์‚ฐ์ด ์ค„์–ด๋“œ๋‹ˆ๊นŒ ๊ทธ๋ ‡๊ฒŒ ๋งŽ์ด ํ™œ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋œป? + bubble time ์ฒ˜๋Ÿผ ๋‹ค๋ฅธ gpu ์—ฐ์‚ฐ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์‹œ๊ฐ„.. ๏ฎ ๊ฐ worker : ๊ฐ™์€ ๋ฐ์ดํ„ฐ, ๋‹ค๋ฅธ ๋ชจ๋ธ ๏ฎ ๋…ผ๋ฌธ โ‘  Multi-GPU training of ConvNets : ํ•„ํ„ฐ๋“ค์ด ์—ฌ๋Ÿฌ ์ปดํ“จํŒ… ๋…ธ๋“œ์— ๋ถ„์‚ฐ๋˜์–ด ๋™์‹œ์— convolution ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. โ‘ก DistBelief : ๋ฒ”์šฉ ์ปดํ“จํŒ… ํ™˜๊ฒฝ

  • Hybrid Parallelism ๏ฎ Data Parallelism + Model Parallelism ๏ฎ ๋…ผ๋ฌธ : Large scale distributed deep networks

[๋ฒˆ์™ธ]

  • Pipelining ๏ฎ data parallelism ๏ฎ model parallelism : ์ธต ๋‹จ์œ„๋กœ ๋ชจ๋ธ์„ ๋ถ„์‚ฐํ•˜๋Š” ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”์˜ ์ผ์ข… (๊ณ„์ธต๋ณ„ model parallelism ์ด๋ž‘ ๊ฐ™?) โ‘  Performance analysis of a pipelined backpropagation parallel algorithm : DNN ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ถ„์‚ฐ๋˜์–ด ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๊ทœ๋ชจ DNN ํ•™์Šต์— ์ ํ•ฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ์–ด๋–ค ์ปดํ“จํŒ… ๋…ธ๋“œ์— backward pass์— ์˜ํ•ด์„œ ์˜ค๋ฅ˜๊ฐ€ ์—ญ์ „ํŒŒ๋˜์—ˆ์„ ๋•Œ, ๊ทธ ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ forward pass ๊ฐ’๊ณผ weight๋Š” ์ด๋ฏธ ๋‹ค๋ฅธ mini-batch์— ์˜ํ•ด์„œ ๋ณ€๊ฒฝ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž˜๋ชป๋œ forward pass ๊ฐ’๊ณผ weight๋ฅผ ์‚ฌ์šฉํ•˜๋Š” delayed gradient ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

โ‘ก Decoupled parallel backpropagation using delayed gradient (DDG) : forward pass๋Š” ์ˆœ์ฐจ๋กœ ์ง„ํ–‰ํ•˜๊ณ , backward pass ๋งŒ pipeline ์œผ๋กœ ๋ณ‘๋ ฌํ™” โ‘ข Features replay : loss (์ •ํ™•ํ•œ gradient โ€“ delayed gradient) ํ•จ์ˆ˜ ์ •์˜. ๊ฐฑ์‹ ๋œ forward pass ๊ฐ’์ด ์•„๋‹Œ ๊ฐฑ์‹ ๋˜๊ธฐ ์ „ delayed forward pass ๊ฐ’์„ ์‚ฌ์šฉํ•จ โ‘ฃ Decoupled Neural Interface (DNI) : ํ˜„์žฌ ์ธต์˜ forward pass๊ฐ’๋งŒ์œผ๋กœ ๊ทผ์‚ฌ gradient ๋ฅผ ๊ตฌํ•˜๊ณ , ์‹ค์ œ gradient ์™€์˜ ์ฐจ์ด๊ฐ€ ์ ์–ด์ง€๋„๋ก RNN ์— ์ ์šฉํ•˜์˜€์œผ๋‚˜, ๊ทผ์‚ฌ gradient ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. โ‘ค SpecTrain : momentum์„ ์ด์šฉํ•˜์—ฌ ๋ฏธ๋ž˜์˜ weight ๊ฐ’์„ ์ถ”์ •ํ•˜๊ณ , ์ด๋ฅผ forward pass์—์„œ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ backward pass์—์„œ delayed gradient ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ปดํ“จํŒ… ๋…ธ๋“œ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๋ฏธ๋ž˜ weight๊ฐ€ ๋ถ€์ •ํ™•ํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค. โ‘ฅ GPipe : mini-batch๋ฅผ ์„ธ๋ถ„ํ•œ micro-batch ์ˆ˜์ค€์—์„œ forward pass๋ฅผ ๋ชจ๋‘ pipeline์œผ๋กœ ์ˆ˜ํ–‰ํ•œ ํ›„, backward pass๋ฅผ pipeline์œผ๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ delayed gradient ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ๊ธฐ์กด์˜ Sgd ์™€ ๋™์ผํ•œ ์ˆ˜ํ–‰ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ forwardpass๊ฐ€ ๋ชจ๋‘ ๋๋‚œ ํ›„์— backwardpass๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ๋–„๋ฌธ์— mini-batch ์˜ ํฌ๊ธฐ์™€ ์ปดํ“จํŒ… ๋…ธ๋“œ ์ˆ˜์— ๋”ฐ๋ผ ๋ณ‘๋ ฌํ™” ํšจ์œจ์„ฑ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ๋‹ค. โ‘ฆ Analysis of parallel training algorithms for deep neural networks + PipeDream:generalized~ : delayed gradient ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ง€์—ฐ ๋ฐœ์ƒ ํšŸ์ˆ˜๋งŒํผ ๋ชจ๋ธ ์‚ฌ๋ณธ์„ ์œ ์ง€ํ•ด์„œ ํ•ด๋‹น back ward pass์— ๋Œ€์‘๋˜๋Š” forward pass ๋•Œ ์‚ฌ์šฉํ–ˆ๋˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐฑ์‹ ํ•˜๋„๋ก ํ•œ๋‹ค. ์ด ๊ฒฝ์šฐ ์ปดํ“จํŒ… ๋…ธ๋“œ์—๋Š” ์—ฌ๋Ÿฌ ์‹œ๊ฐ„๋Œ€์˜ ๋ชจ๋ธ์ด ์กด์žฌํ•œ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ ๋ณ‘๋ ฌํ™”๋ฅผ ํ•˜์˜€์ง€๋งŒ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌํ™”๊ฐ€ ๋œ ๊ฒฝ์šฐ๋ผ์„œ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๋™๊ธฐํ™”์˜ ๊ฒฝ์šฐ : synchronous sgd๋Š” ์ปดํ“จํŒ… ๋…ธ๋“œ ๊ฐ„ ํ†ต์‹ ์ด ํ•„์š”ํ•˜์ง€๋งŒ pipelined sgd๋Š” ๋™๊ธฐํ™”ํ•  ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋™์ผํ•œ ์ปดํ“จํŒ… ๋…ธ๋“œ์— ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ†ต์‹ ์€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค.

[Parameter ๋™๊ธฐํ™” ๋ฐฉ์‹์— ๋”ฐ๋ผ]

  • Synchronous replication (synchronous SGD) ๏ฎ ์ผ์ • ์‹œ์ ์—์„œ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ๊ฐ€ ๋๋‚œ ๋ชจ๋“  ์ปดํ“จํŒ… ๋…ธ๋“œ๋“ค์˜ gradient ํ‰๊ท ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๏ฎ ์ˆ˜๋ ด์ด ๋น ๋ฅด๋‹ค. ๏ฎ ๋…ผ๋ฌธ ๏ต Data Parallelism โ‘  SimuParallelSGD : ๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ sgd ์ˆ˜ํ–‰ํ•˜๊ณ  ๋งˆ์ง€๋ง‰ ํ•œ๋ฒˆ๋งŒ์— ๋งˆ์Šคํ„ฐ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์ƒ์„ฑ โ‘ก Bulk-Synchronous Parallel (BSP) : ํ•œ minibatch ๋งˆ๋‹ค ๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ gradient ํ‰๊ท  ๋‚ด์–ด ๋งˆ์Šคํ„ฐ๋ชจ๋ธ ๊ฐฑ์‹  โ‘ข Experiments on parallel training of deep neural network using model averaging : x๊ฐœ์˜ minibatch ๋งˆ๋‹ค ๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ gradient ํ‰๊ท  ๋‚ด์–ด ๋งˆ์Šคํ„ฐ๋ชจ๋ธ ๊ฐฑ์‹  โ‘ฃ Elastic averaging (EASGD) : ๊ฐ ์ง€์—ญ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ ์œ ์ง€๋˜์ง€๋งŒ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฉ€์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ์ •๋„๊ฐ€ ์ œ์–ด๋œ๋‹ค. โ‘ค Block-wise Model-Update Filtering (BMUF) : ๋ชจ๋ฉ˜ํ…€ ๊ฐœ๋… ์ถ”๊ฐ€? ์ด์ „ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ํ˜„์žฌ ์ง€์—ญ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ํ‰๊ท ์˜ ์ฐจ์ด๋ฅผ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€ํ™”๋Ÿ‰์œผ๋กœ ๊ฐ„์ฃผํ•˜๊ณ  ๋ชจ๋ฉ˜ํ…€์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹  โ‘ฅ Sandblaster L-BFGS : ๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ์—์„œ ์ „์†ก๋ฐ›์€ gradient ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ limited memory Broyden Fletcher Goldfarb Shanno ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ถ„์‚ฐ๋˜์–ด ์žˆ๋Š” ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐฑ์‹ ํ•œ๋‹ค. (๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ gradient๋ฅผ ํ•ฉํ•˜๋Š” ๋ฐฉ์‹์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•œ๋‹ค๋Š” ๊ฒƒ?) โ‘ฆ Sync-SGD : k๊ฐœ์˜ ์ปดํ“จํŒ… ๋…ธ๋“œ์—์„œ gradient ๊ณ„์‚ฐ์ด ๋๋‚˜๋ฉด ๋‚˜๋จธ์ง€ ์ปดํ“จํŒ… ๋…ธ๋“œ๋Š” ๊ธฐ๋‹ค๋ฆฌ์ง€ ์•Š๊ณ  ๋™๊ธฐํ™” ์ง„ํ–‰ (๊ทผ๋ฐ ์ด๊ฑฐ๋Š” asynchronous ์•„๋‹Œ๊ฐ€?)

๏ต Model Parallelism โ‘  S

  • Stale Synchronous / Bounded Aynchronous replication (stale-synchronous SGD) ๏ฎ
  • Asynchronous replication (Asynchronous SGD) ๏ฎ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ๊ฐ€ ๋๋‚˜์ง€ ์•Š์•„๋„ gradient ๊ณ„์‚ฐ์ด ๋จผ์ € ๋๋‚œ ๊ฑด ๋จผ์ € ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ. ๋™๊ธฐํ™” ๋น„์šฉ์ด ์ ๊ณ , ๋А๋ฆฌ๊ฑฐ๋‚˜ ๊ณ ์žฅ๋‚œ ์ปดํ“จํŒ… ๋…ธ๋“œ๋ฅผ ๊ธฐ๋‹ค๋ฆด ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋„๋ฆฌ ์‚ฌ์šฉ๋จ ๏ฎ worker๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก synchronous ๋ณด๋‹ค๋Š” ํšจ์œจ์  ๏ฎ ๋จผ์ € ๋„์ฐฉํ•œ gradeint -> ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๊ฐฑ์‹ . ๋Šฆ๊ฒŒ ๋„์ฐฉํ•œ gradient ๋Š” ๊ฐฑ์‹ ๋˜๊ธฐ ์ด์ „์˜ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ํ†ตํ•ด์„œ ๊ณ„์‚ฐ๋˜์—ˆ์œผ๋ฏ€๋กœ ์ง€๊ธˆ ํ˜„์žฌ ๊ฐฑ์‹ ๋œ ์ƒํƒœ์˜ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๊ฐฑ์‹ ํ•˜๋ ค๋ฉด ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋Šฆ๊ฒŒ ๋„์ฐฉํ•œ gradient ๋ฅผ stale gradient ๋ผ๊ณ  ํ•˜๊ณ , ์—ฌ๊ธฐ์— ์‚ฌ์šฉ๋˜์—ˆ๋˜ stale weight(๊ฐฑ์‹ ๋˜๊ธฐ ์ด์ „์˜ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ) ๏ฎ ์ˆ˜๋ ด์ด ๋А๋ฆด ์ˆ˜ ์žˆ๋‹ค. ๏ฎ ๋…ผ๋ฌธ ๏ต Data Parallelism โ‘  Hogwild โ‘ก HogBatch : mini-batch โ‘ข AsySVRG : SGD์˜ ๋ถ„์‚ฐ์„ ์ค„์—ฌ์„œ ํ•™์Šต ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ด โ‘ฃ Downpour SGD : ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๋Œ€์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„์— ๋‚˜๋‰˜์–ด ์ €์žฅ๋จ (์›๋ž˜ ํ•œ ๋Œ€์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„๋ณด๋‹ค ๋ณ‘๋ ฌํ™” ํšจ๊ณผ๊ฐ€ ๋†’์•„์งˆ ์ˆ˜ ์žˆ๋‹ค). + DistBelief๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ์ปดํ“จํŒ… ๋…ธ๋“œ๊ฐ€ ์‹ค์ œ๋กœ๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ cpu ๋…ธ๋“œ๋กœ ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์žˆ์Œ (large scale distributed deep networks). stale weight ์ด์™ธ์— ์ผ๊ด€์„ฑ์ด ์—†๋Š” ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ƒ์„ฑ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ถ”๊ฐ€์ ์œผ๋กœ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. โ‘ค Asynchronous decentralized parallelized SGD (AD-PSGD) : ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ์„ ๊ด€๋ฆฌํ•˜๋Š” ํ•˜๋‚˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„๊ฐ€ ์žˆ์ง€ ์•Š๊ณ , ์ปดํ“จํŒ… ๋…ธ๋“œ๊ฐ€ ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๋กœ ์—ฐ๊ฒฐ๋˜์–ด์žˆ๋‹ค. ๊ฐ ์ปดํ“จํŒ… ๋…ธ๋“œ๋Š” ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ์ธ์ ‘ ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ ์ง€์—ญ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์กฐํ•ฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™๊ธฐํ™”ํ•œ๋‹ค. ์ด๋กœ์จ ์—ฌ๋Ÿฌ ๋Œ€์˜ ์ปดํ“จํŒ… ๋…ธ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„์™€ โ€˜๋™์‹œ์—โ€™ ํ†ต์‹ ํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋–„๋ฌธ์— ํ†ต์‹  ๋ณ‘๋ชฉ ํ˜„์ƒ์ด ๊ฐ์†Œ๋œ๋‹ค. ํ•™์Šต์ด ์ข…๋ฃŒ๋˜๋ฉด ๋ชจ๋“  ์ปดํ“จํŒ… ๋…ธ๋“œ์˜ ์ง€์—ญ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ‰๊ท ํ•ด์„œ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ตฌํ•œ๋‹ค. โ‘ฅ Asynchrony begets momuntum, with an application to deep learning : stale weight ๊ด€๋ จ โ€“ ๊ฐฑ์‹ ๋˜๊ธฐ ์ด์ „์˜ ๊ณผ๊ฑฐ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์œ ๋„๋œ gradient๋ฅผ ์ผ์ข…์˜ momentum ์œผ๋กœ ๊ฐ„์ฃผํ•˜๊ณ , momentum ๊ฐ€์ค‘์น˜๋ฅผ ์ตœ์ ํ™” โ‘ฆ Asynchronous stochastic gradient descent with delay compensation : stale weight ๊ด€๋ จ โ€“ taylor expansion ์„ ํ†ตํ•ด stale gradient ๊ฐ’์„ ๋ณด์ •ํ•ด์„œ ํ˜„์žฌ ๋งˆ์Šคํ„ฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ gradient ๊ฐ’์„ ์˜ˆ์ธกํ•จ โ‘ง Efficient and robust parallel ~ : ๋ฏธ๋ž˜ gradient ์˜ˆ์ธก์ธ ๋“ฏ?

๏ต Model Parallelism โ‘  S

  • Hybrid-Synchronous SGD ๏ฎ ๋…ผ๋ฌธ ๏ต Revisiting distributed synchronous SGD : synchrnous ์™€ asynchronous ๋ฐฉ์‹์˜ ๋‹จ์ ์„ ์†Œ๊ฐœํ•˜๋‚˜๋ด„ ๏ต

  • (Model Averaging) : ๋…ผ๋ฌธ : deep learning with elastic averaging SGD? EWMA?

  • (Ensemble Learning)

[Gradient ์ทจํ•ฉ ๋ฐฉ์‹์— ๋”ฐ๋ผ]

  • All-Reduce (Parameter Server) ๏ฎ parameter server๊ฐ€ ๋ชจ๋“  gradient ๋ฅผ ์ทจํ•ฉํ•˜์—ฌ worker ๋“ค์—๊ฒŒ ์žฌ๋ถ„๋ฐฐ ๏ฎ worker ์ˆ˜๊ฐ€ ๋งŽ์œผ๋ฉด parameter server์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋ฐ ๋„คํŠธ์›Œํฌ ๋ถ€ํ•˜ ์ฆ๊ฐ€
  • Ring-AllReduce ๏ฎ ๋ชจ๋“  gpu๋ฅผ ring ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑํ•œ ๋’ค, gradient ์ „๋‹ฌ์„ ํ†ตํ•ด ๊ณต์œ  ๏ฎ ์ž˜ ์ดํ•ด ์•ˆ ๋จ. ๋…ผ๋ฌธ?

Training Agent, Computing Node, Server, Worker