Batch Normalization - BD-SEARCH/MLtutorial GitHub Wiki

Batch Normalization

paper: https://arxiv.org/abs/1502.03167

repo: https://github.com/shuuki4/Batch-Normalization

Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

01. Batch Normalization?

  • ๋ฐฐ์น˜ ์ •๊ทœํ™”.
  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ฐ ์ธต์˜ ํ™œ์„ฑํ™”๋ฅผ ์ ๋‹นํžˆ ํผํŠธ๋ฆฌ๋„๋ก ๊ฐ•์ œํ•œ๋‹ค.
  • Covariate Shift๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•

Covariate Shift

  • ํ•™์Šตํ•˜๋Š” ๋„์ค‘์— ์ด์ „ layer์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€ํ™”๋กœ ์ธํ•ด ํ˜„์žฌ layer์˜ ์ž…๋ ฅ ๋ถ„ํฌ๊ฐ€ ๋ฐ”๋€Œ๋Š” ํ˜„์ƒ.
  • ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ๋Š”
    • ๊ฐ layer๋กœ ๋“ค์–ด๊ฐ€๋Š” input์„ whitening(input์„ ํ‰๊ท 0, ๋ถ„์‚ฐ1๋กœ ํ•˜๋Š” ๊ฒƒ)ํ•ด์ค€๋‹ค.
      • backpropagation๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ง„ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ณ„์† ์ปค์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • Batch Normalization์„ ์‚ฌ์šฉ
      • ๋ณ„๋„์˜ process๊ฐ€ ์•„๋‹ˆ๋ผ training ์‹œ์— ํ‰๊ท , ๋ถ„์‚ฐ ์กฐ์ • ๊ณผ์ •์ด ๊ฐ™์ด ์กฐ์ ˆ๋˜๋ฏ€๋กœ whitening๊ณผ ๋‹ค๋ฅด๋‹ค.

02. ๋ฐฐ์น˜ ์ •๊ทœํ™”์˜ ์žฅ์ 

  • ํ•™์Šต ์†๋„ ๊ฐœ์„ : ๋” ์ ์€ epoch๋กœ ๋” ๋‚ฎ์€ error rate๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.
  • ์ดˆ๊นƒ๊ฐ’์— ์˜์กดํ•˜์ง€ ์•Š์Œ: ์ดˆ๊นƒ๊ฐ’์„ ์ž˜ ์„ค์ •ํ•˜์ง€ ์•Š์•„๋„ ๊ฐ ์ธต์˜ ํ™œ์„ฑํ™”๊ฐ’ ๋ถ„ํฌ๊ฐ€ ๊ณ ๋ฅด๊ฒŒ ๋œ๋‹ค
  • ์˜ค๋ฒ„ํ”ผํŒ… ์–ต์ œ: ๋“œ๋กญ์•„์›ƒ ๋“ฑ์˜ ํ•„์š”์„ฑ ๊ฐ์†Œ

03. ๋ฐฐ์น˜ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•

image

input์œผ๋กœ ๋“ค์–ด์˜จ ๊ฐ๊ฐ์˜ feature์— ๋Œ€ํ•˜์—ฌ Batch Normalization์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

image

  1. mini-batch์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๊ฐ€ ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. (์œ„ ์‹์˜ normalize ๋ถ€๋ถ„)
  2. ๋ฐ์ดํ„ฐ์˜ ํ™•๋Œ€ ๋ฐ ์ด๋™์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. (์œ„ ์‹์˜ scale and shift ๋ถ€๋ถ„)
  • ฯต๋Š” normalize์‹œ devide-by-zero error๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ž‘์€ ๊ฐ’์ด๋‹ค.
  • ฮณ๋Š” ๋ฐ์ดํ„ฐ์˜ ํ™•๋Œ€๋ฅผ, ฮฒ๋Š” ๋ฐ์ดํ„ฐ์˜ ์ด๋™์„ ๋‹ด๋‹นํ•œ๋‹ค.

Batch Normalization์„ ์œ„ํ•œ factor๋Š”, ฮณ๋Š” 1, ฮฒ์€ 0์œผ๋กœ ์ง€์ •ํ•œ ๋’ค ์•„๋ž˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋”ฐ๋ผ ํ•™์Šตํ•œ๋‹ค.

image

ํ•™์Šต์ด ์™„๋ฃŒ๋˜๋ฉด ํ•™์Šต ์…‹ ์ „์ฒด์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ธฐ๋กํ•ด ๋‘”๋‹ค. inferenceํ•  ๋•Œ๋Š” mini-batch์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ๋Œ€์‹ , ์ด ๊ธฐ๋ก๋œ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์ด์šฉํ•ด normalization์„ ํ•œ๋‹ค.

CNN์˜ batch normalization์€ convolution์˜ ์„ฑ์งˆ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด channel๋ณ„๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.

image

  • g(): activation function
  • BN(): Batch Normalization

์ฆ‰, channel ๊ฐœ์ˆ˜๊ฐ€ n๊ฐœ๋ผ๋ฉด n๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ Batch Normalization์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค.

04. Reference