Loss Function - BD-SEARCH/MLtutorial GitHub Wiki

1. Loss Function

  • ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ๋œ ๊ฐ’๊ณผ ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ๊ฐ’์ด ์ฐจ์ด๋‚˜๋Š” ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜
  • loss function์˜ ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต
  • neural network์—์„œ์˜ ์ตœ์ ํ™” : output๊ณผ label ์ฐจ์ด๋ฅผ Error๋กœ ์ •์˜ํ•œ ํ›„, ์ด ๊ฐ’์„ ์ค„์ด๋„๋ก parameter๋ฅผ ๋ฐ”๊พธ์–ด ๋‚˜๊ฐ€๋Š” ๊ฒƒ

2. ์ข…๋ฅ˜

2-1. Linear regression (MSE, Mean Square Error)

(1) ์ •์˜

image

(2) ํŠน์ง•

  • ์ฃผ๋กœ regression ๋ฌธ์ œ์— ์‚ฌ์šฉํ•œ๋‹ค.

(3) regression์— MSE๋ฅผ ์“ฐ๋Š” ์ด์œ 

Classification ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ, ๋งž๋‹ค/์•„๋‹ˆ๋‹ค๊ฐ€ ํŒ๋ณ„์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์ฃผ์‹ ๊ฐ€๊ฒฉ ์˜ˆ์ธก๊ณผ ๊ฐ™์€ ์ˆ˜์น˜ ํŒ๋‹จ์€ ์• ๋งคํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

  • ex. ์ฃผ์‹

    • GT: 100,000์›, output: 95,000์›
    • output์ด GT์™€ ๋™์ผํ•˜์ง„ ์•Š์ง€๋งŒ, ์ด ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ํŒ๋‹จํ•œ ๊ฒƒ์ธ์ง€ ์• ๋งคํ•˜๋‹ค.
    • ๋”ฐ๋ผ์„œ ์‹ค์ œ ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์˜ค์ฐจ๋ฅผ ํŒ๋‹จํ•ด์•ผ ํ•œ๋‹ค.
  • MSE ๊ฐ’์ด ์ž‘์€ ๋ฐ”๋žŒ์งํ•œ ์ถ”์ •๋Ÿ‰์ด๋ž€, ๋ถˆํŽธ์„ฑ๊ณผ ํšจ์œจ์„ฑ์„ ๋งŒ์กฑํ•˜๋Š” ๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค.

    • ๋ถˆํŽธ์„ฑ(unbiasedness): ์ถ”์ •๋Ÿ‰์˜ ํ‰๊ท ์ด ๊ฐ€๋Šฅํ•œ ํ•œ ๋ชจ์ˆ˜์˜ ํ‰๊ท ์— ๊ทผ์ ‘
    • ํšจ์œจ์„ฑ(efficiency): ์ถ”์ •๋Ÿ‰์˜ ๋ถ„์‚ฐ์ด ๋™์‹œ์— ์ž‘์•„์•ผ ํ•จ
  • ์ฆ‰ MSE๋Š” ์ •๋‹ต์…‹๊ณผ์˜ ํ‰๊ท ์ ์ธ ์ฐจ์ด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ๊ฐ์˜ ์ถœ๋ ฅ๊ฐ’์ด ์ •๋‹ต๊ณผ ์–ผ๋งˆ๋‚˜ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ๋‚˜๋Š”์ง€๋„ ๋ฐ˜์˜ํ•œ๋‹ค.

2-2. Cross-Entropy Error (CEE)

(1) ์ •์˜

์ฃผ์–ด์ง„ ํ™•๋ฅ  ๋ณ€์ˆ˜ X์— ๋Œ€ํ•ด, ํ™•๋ฅ  ๋ถ„ํฌ p๋ฅผ ์ฐพ์•„๋ณด์ž. ํ™•๋ฅ  ๋ถ„ํฌ p๋ฅผ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— p๋ฅผ ์˜ˆ์ธกํ•œ ๊ทผ์‚ฌ ๋ถ„ํฌ q๋ฅผ ์ƒ๊ฐํ•œ๋‹ค. ์ •ํ™•ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด q์˜ parameter๋“ค์„ updateํ•˜๋ฉด์„œ q๋ฅผ p์— ๊ทผ์‚ฌํ•  ๊ฒƒ์ด๋‹ค. ์ฆ‰ ๋‘ ๋ถ„ํฌ์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” KL(p|q)๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋Š” q๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค.

๋‘๋ฒˆ์งธ ํ•ญ์€ ๊ทผ์‚ฌ ๋ถ„ํฌ q์— ๋ฌด๊ด€ํ•œ ํ•ญ์ด๋ฏ€๋กœ KL Divergence๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ ๊ฒฐ๊ตญ ์ฒซ๋ฒˆ์งธ ํ•ญ์ด๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์ฒซ๋ฒˆ์งธ ํ•ญ์„ ์ตœ์†Œํ™”ํ•˜๋Š” q๋ฅผ ์ฐพ์•„์•ผ ํ•œ๋‹ค.

  • p_i: ์‹ค์ œ ํ™•๋ฅ  ๋ถ„ํฌ
  • q_i: p๋ฅผ ๊ทผ์‚ฌํ•œ ๋ถ„ํฌ

(2) ํŠน์ง•

  • classification(๋ถ„๋ฅ˜๋ฌธ์ œ)์—๋Š” ACE(Average cross-entropy)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

(3) classification์— ACE(Average cross-entropy)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 

Model X, Y๊ฐ€ ์žˆ๊ณ , class๋Š” A,B,C 3๊ฐœ๊ฐ€ ์žˆ๋‹ค.

Model X์˜ output

output label A B C correct?
0.3 0.3 0.4 0 0 1 Y
0.3 0.4 0.3 0 1 0 Y
0.1 0.2 0.7 1 0 0 N
  • 1,2๋Š” ๊ฒจ์šฐ ๋งž์ท„๊ณ  3์€ ์™„์ „ํžˆ ํ‹€๋ ธ๋‹ค.

Model Y์˜ output

output label A B C correct?
0.1 0.2 0.7 0 0 1 Y
0.1 0.7 0.2 0 1 0 Y
0.3 0.4 0.3 1 0 0 N
  • 1,2๋Š” ํ™•์‹คํžˆ ๋งž์ท„์œผ๋‚˜ 3์€ ์•„์‰ฝ๊ฒŒ ํ‹€๋ ธ๋‹ค.
[๋‹จ์ˆœ ๋ถ„๋ฅ˜ ์˜ค์ฐจ]
* model X : 1/3 = 0.33
* model Y : 1/3 = 0.33
[๋ถ„๋ฅ˜ ์ •ํ™•๋„]
* model X : 2/3 = 0.67
* model Y : 2/3 = 0.67
  • ๋‹จ์ˆœ ๋ถ„๋ฅ˜ ์˜ค์ฐจ ๊ณ„์‚ฐ์€ ํ‹€๋ฆฐ ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์„ ๋ฟ, label๊ณผ ๋น„๊ตํ•˜์—ฌ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ํ‹€๋ ธ๋Š” ์ง€๋Š” ์ œ๊ณตํ•˜์ง€ ์•Š๋Š”๋‹ค.

cross entropy๋กœ ๊ณ„์‚ฐํ•  ๊ฒฝ์šฐ

model X
* ์ฒซ๋ฒˆ์งธ sample : -( (ln(0.3)*0) + (ln(0.3)*0) + (ln(0.4)*1) ) = -ln(0.4)
* 3๊ฐœ sample ๋ชจ๋‘์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋ฐ ACE (Average cross-entropy)
  * -(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
model Y
* 3๊ฐœ sample ๋ชจ๋‘์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋ฐ ACE (Average cross-entropy)
  * -(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64

Model X๋ณด๋‹ค Y๊ฐ€ ์˜ค์ฐจ๊ฐ€ ๋” ์ž‘๋‹ค. ์ฆ‰ ์–ด๋–ค model์ด ๋” ์ž˜ ํ•™์Šต ๋˜์—ˆ๋Š” ์ง€๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

MSE๋กœ ๊ณ„์‚ฐํ•  ๊ฒฝ์šฐ

model X
* ์ฒซ๋ฒˆ์งธ sample : (0.3 - 0)^2 + (0.3 - 0)^2 + (0.4 - 1)^2 = 0.09 + 0.09 + 0.36 = 0.54
* 3๊ฐœ sample ๋ชจ๋‘์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ๋ฐ MSE(Mean squared error)
  * (0.54 + 0.54 + 1.34) / 3 = 0.81
model Y
* (0.14 + 0.14 + 0.74) / 3 = 0.34

MSE๋Š” ํ‹€๋ฆฐ sample์— ๋Œ€ํ•ด ๋” ์ง‘์ค‘ํ•œ๋‹ค. ๋งž์€ ๊ฒƒ๊ณผ ํ‹€๋ฆฐ ๊ฒƒ ๋ชจ๋‘์— ๋˜‘๊ฐ™์ด focusํ•ด์•ผ ํ•˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š๋‹ค.

ACE์™€ MSE ๋น„๊ต (activation์„ softmax๋กœ ํ–ˆ์„ ๊ฒฝ์šฐ)

  • background : backpropagation ์ค‘์— label์— ๋”ฐ๋ผ output์„ 1.0 ๋˜๋Š” 0.0์œผ๋กœ ์„ค์ •ํ•˜๋ ค๊ณ  ํ•œ๋‹ค.
  • MSE๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
    • ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ์—์„œ ๊ธฐ์šธ๊ธฐ ๊ฐ’์— (output) * (1 - output)์ด๋ผ๋Š” ์กฐ์ • ์š”์†Œ๊ฐ€ ํฌํ•จ๋œ๋‹ค.
    • ๊ณ„์‚ฐ ๋œ ์ถœ๋ ฅ์ด 0.0 ๋˜๋Š” 1.0์— ๊ฐ€๊น๊ฑฐ๋‚˜ ๊ฐ€๊นŒ์›Œ์ง์— ๋”ฐ๋ผ (output) * (1 - output)์˜ ๊ฐ’์€ ์ ์  ์ž‘์•„์ง„๋‹ค.
      • ex) output = 0.6์ด๋ผ๋ฉด (output) * (1 - output) = 0.24์ด์ง€๋งŒ ์ถœ๋ ฅ์ด 0.95์ด๋ฉด (output) * (1 - output) = 0.0475์ด๋‹ค.
    • ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ์กฐ์ • ์š”์†Œ๊ฐ€ ์ ์  ์ž‘์•„์ง€๋ฉด์„œ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™”๋„ ์ ์  ์ž‘์•„์ง€๊ณ  ํ•™์Šต ์ง„ํ–‰์ด ๋ฉˆ์ถœ ์ˆ˜๋„ ์žˆ๋‹ค.
  • ACE๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ
    • (output) * (1 - output) ํ•ญ์ด ์‚ฌ๋ผ์ง„๋‹ค.
    • ๋”ฐ๋ผ์„œ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™”๋Š” ์ ์  ์ž‘์•„์ง€์ง€ ์•Š์œผ๋ฏ€๋กœ ํ•™์Šต์ด ๋ฉˆ์ถ”๋Š” ์ผ์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค.

2-3. Logistic regression (binary cross-entropy)

2-4. Hinge Loss

SVM ๋“ฑ์—์„œ Maximum-margin classification์— ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ Loss function

2-5. Ranking Loss

Reference