L1,L2 regularization - BD-SEARCH/MLtutorial GitHub Wiki
λ³Έλ¬Έμ μ½κΈ° μ μ regularizationμ μ½κ³ μ€μλ©΄ μ’μ΅λλ€.
1. L1, L2 norm
1-1. Norm?
norm? 벑ν°μ ν¬κΈ°(κΈΈμ΄)λ₯Ό μΈ‘μ νλ λ°©λ². λ λ²‘ν° μ¬μ΄μ 거리λ₯Ό μΈ‘μ νλ λ°©λ²
- p : Norm μ μ°¨μ
- p = 1 : L1 Norm
- P = 2 : L2 Norm
- nμ ν΄λΉ 벑ν°μ μμ μ
(1) L1 norm
- λ²‘ν° p, q μ κ° μμλ€μ μ°¨μ΄μ μ λκ°μ ν©
vector p = (3,1,-3), q = (5,0,7)
p,qμ L1 norm : |3-5| + |1-0| + |-3 -7| = 2 + 1 + 10 = 13
(2) L2 norm
-
λ²‘ν° p, q μ μ ν΄λ¦¬λμ 거리(μ§μ 거리)
-
q κ° μμ μΌ κ²½μ° λ²‘ν° p, qμ L2 Norm : λ²‘ν° p μ μμ μΌλ‘λΆν°μ μ§μ 거리λ₯Ό μλ―Έ
- p = (x_1, x_2, ... , x_n), q = (0, 0, ... , 0)
(3) Difference btw L1, L2 norm
-
L1 Norm : λΉ¨κ°μ, νλμ, λ Έλμ μ μΌλ‘ νν κ°λ₯
-
L2 Norm : μ΄λ‘μ μ μΌλ‘λ§ νν κ°λ₯
-
L1 Norm μ μ¬λ¬κ°μ§ path λ₯Ό κ°μ§μ§λ§ L2 Norm μ Unique shortest path λ₯Ό κ°μ§λ€
- Ex.
- p = (1, 0), q = (0, 0) μΌ λ L1 Norm = 1, L2 Norm = 1 λ‘ κ°μ κ°μ§λ§ μ¬μ ν Unique shortest path.
- Ex.
1-2. L1, L2 Loss
(1) L1 Loss
y_i : label
f(x_i) : output
L1 Loss : labelκ³Ό output μ¬μ΄μ μ€μ°¨ κ°μ μ λκ°μ ꡬν ν κ·Έ μ€μ°¨λ€μ ν©
- λμμ΄
- Least absolute deviations(LAD)
- Least absolute Errors(LAE)
- Least absolute value(LAV)
- Least absolute residual(LAR)
- Sum of absolute deviations
(2) L2 Loss
L2 Loss : μ€μ°¨μ μ κ³±μ ν©
- λμμ΄ : Least squares error(LSE)
(3) Difference btw L1 Loss, L2 Loss
- L1 Lossμ λ¨μ : 0μΈ μ§μ μμ λ―ΈλΆμ΄ λΆκ°λ₯νλ€
- L2 Loss λ μ§κ΄μ μΌλ‘ μ€μ°¨μ μ κ³±μ λνκΈ° λλ¬Έμ Outlier μ λ ν° μν₯μ λ°λλ€.
- L1 Loss κ° L2 Loss μ λΉν΄ Outlier μ λνμ¬ λ Robustνλ€.
- L1 Loss : Outlier κ° μ λΉν 무μλκΈΈ μν λ μ¬μ©
- L2 Loss : Outlier μ λ±μ₯μ μ κ²½μ¨μΌ νλ κ²½μ°
2. L1, L2 regularization
2-1. L1 regularization (Lasso)
(1) μμ
- μ€ν caseμ λ°λΌ μμ 1/n/ 1/2κ° λ¬λΌμ§λ κ²½μ°κ° μλ€.
- Ξ» : μμ. 0μ κ°κΉμΈ μλ‘ μ κ·νμ ν¨κ³Όλ μμ΄μ§λ€
- C_0 : μλμ cost function
(2) νΉμ§
-
cost functionμ κ°μ€μΉμ μ λκ°μ λν΄μ€λ€λ κ²μ΄ μ€μ
- κ°μ€μΉ wμ λν΄ νΈλ―ΈλΆμ νλ©΄
- wκ° μ체λ₯Ό μ€μ΄λ κ²μ΄ μλ wμ λΆνΈμ λ°λΌ μμκ°μ λΉΌμ£Όλ λ°©μμΌλ‘ regularization μν
-
κΈ°μ‘΄μ cost function μ κ°μ€μΉμ ν¬κΈ°κ° ν¬ν¨λλ©΄μ κ°μ€μΉκ° λ무 ν¬μ§ μμ λ°©ν₯μΌλ‘ νμ΅
-
L1 Regularization μ μ¬μ©νλ Regression model
: Least Absolute Shrinkage and Selection Operater(Lasso) Regression
2-2. L2 regularization (Lidge)
(1) μμ
- μ€ν caseμ λ°λΌ μμ 1/n/ 1/2κ° λ¬λΌμ§λ κ²½μ°κ° μλ€.
- L : κΈ°μ‘΄μ cost function
- n : train dataμ μ
- Ξ» : regularization λ³μ. μμ. 0μ κ°κΉμΈ μλ‘ μ κ·νμ ν¨κ³Όλ μμ΄μ§λ€
- w : κ°μ€μΉ
- C_0 : μλμ cost function
(2) νΉμ§
-
L(κΈ°μ‘΄μ cost function)μ κ°μ€μΉλ₯Ό ν¬ν¨νμ¬ λν¨μΌλ‘μ¨
- Lμ΄ μμμ§λ λ°©ν₯μΌλ‘ νμ΅
- wμ΄ μμμ§λ λ°©ν₯μΌλ‘ νμ΅
- wμ λν΄ νΈλ―ΈλΆνλ©΄ κ°μ΄ μμμ§λ λ°©ν₯μΌλ‘ μ§ννκ² λλ€ : Weight decay
- weight decayμ μν΄ νΉμ κ°μ€μΉκ° λΉμ΄μμ μΌλ‘ 컀μ§κ³ νμ΅μ ν° μν₯μ λΌμΉλ κ²μ λ°©μ§
-
L2 Regularization μ μ¬μ©νλ Regression model
: Ridge Regression
2-3. Difference btw L1,L2 regularization
(1) Regularization
- κ°μ€μΉ w κ° μμμ§λλ‘ νμ΅νλ€λ κ²? Local noise μ μν₯μ λ λ°λλ‘ νκ² λ€λ κ²
- Outlier μ μν₯μ λ μ κ² λ°λλ‘ νκ² λ€λ κ²
(2) μμ
- a,bμ λνμ¬ L1 norm κ³μ°μ
- a,bμ λνμ¬ L2 norm κ³μ°μ
- L1 Norm: κ²½μ°μ λ°λΌ νΉμ Feature(벑ν°μ μμ) μμ΄λ κ°μ κ°μ λΌ μ μλ€
- L2 Norm : κ°κ°μ 벑ν°μ λν΄ νμ Unique ν κ°μ λΈλ€.
(3) κ²°λ‘
- L1 Norm μ νλμ μ λμ λΉ¨κ°μ μ μ μ¬μ©νμ¬ νΉμ Feature λ₯Ό 0μΌλ‘ μ²λ¦¬νλ κ²μ΄ κ°λ₯νλ€.
- L1 Norm μ Feature selection μ΄ κ°λ₯
- μ΄ νΉμ§μ L1 Regularization μ λμΌνκ² μ μ© λ μ μλ€.
- L1 μ Sparse model(coding) μ μ ν©
- convex optimization μ μ μ©νκ² μ°μΈλ€.
λ¨, L1 Regularization μ κ²½μ° μ κ·Έλ¦Όμ²λΌ λ―ΈλΆ λΆκ°λ₯ν μ μ΄ μκΈ° λλ¬Έμ Gradient-base learning μλ μ£Όμκ° νμνλ€.
reference
- https://m.blog.naver.com/laonple/220527647084
- https://light-tree.tistory.com/125
- https://en.wikipedia.org/wiki/Taxicab_geometry
- https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577
- https://www.quora.com/When-would-you-chose-L1-norm-over-L2-norm
- https://www.stand-firm-peter.me/2018/09/24/l1l2/
- https://m.blog.naver.com/laonple/220527647084
- https://ratsgo.github.io/machine%20learning/2017/10/12/terms/