Ensemble Methods - jaeaehkim/trading_system_beta GitHub Wiki

Motivation

  • Data Structures, Labeling, Sample Weights๋ฅผ ํ†ตํ•ด์„œ Financial Raw Data๋ฅผ ์–ด๋–ป๊ฒŒ ๊ฐ€๊ณตํ•˜์—ฌ ์ •๋ณด๋ฅผ ์‚ด๋ฆฌ๋ฉด์„œ IIDํ™” ์‹œ์ผœ ML ๋ชจ๋ธ์ด ํ•™์Šต ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ณธ์ ์ธ ํ† ๋Œ€๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ๊ฐ€ ๋ถ€ํ„ฐ(Data Structures) ์‹œ์ž‘ํ•ด์„œ ๊ทธ ํ† ๋Œ€๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ข‹์€ ์ •๋‹ต์ง€๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€(Labeling) ์ด๋ฅผ ML ๋ชจ๋ธ์— ๋„ฃ๊ธฐ ์ง์ „์— ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์ ์ธ ์ค‘์ฒฉ์„ฑ ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ณ ๋ฏผํ•œ๋‹ค. ํŠนํžˆ, Bagging์˜ Weak model์—๊ฒŒ ์ค„ Bootstrap Sampling์„ ์งˆ ์ข‹๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์— ๊ด€ํ•œ ์ด์•ผ๊ธฐ๋‹ค.
  • ์ด๋ฒˆ์—” ๋ชจ๋ธ์— ๊ด€ํ•œ ์ง์ ‘์ ์ธ ์–˜๊ธฐ๋ฅผ ํ•˜์ง€๋งŒ ๊ฐœ๋ณ„ ๋ชจ๋ธ์— ๊ด€ํ•œ ์ด์•ผ๊ธฐ ๋ณด๋‹ค๋Š” Ensemble Methods์— ๊ด€ํ•œ ์ด์•ผ๊ธฐ๋‹ค. ๋ชจ๋ธ ํŒŒํŠธ์—์„œ ๊ฐœ๋ณ„ ๋ชจ๋ธ๋ณด๋‹ค ์ค‘์š”ํ•œ ๊ฒƒ์€ Ensemble์— ๋Œ€ํ•œ ์ดํ•ด์ด๋‹ค. ํ•˜๋‚˜์˜ ์˜ˆ์ œ์ธ Random Forest์— ๋Œ€ํ•ด์„  sklearn์— ๋‚˜์˜ค๋Š” ๋‹ค์–‘ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋†’์ด๋ฉด์„œ ์‘์šฉ๋ ฅ์„ ๋†’์ด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋ณด์ธ๋‹ค. ๊ทผ๋ณธ์ ์ธ ๋ถ€๋ถ„์„ ๋‹ค์ ธ๋†“์œผ๋ฉด ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ ํ™•์žฅํ•  ๋•Œ ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ์ž‘์—…ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

The Three Sources of Errors

Mathematical Expression of ML Model

  • image image image

  • image

  • image

    • Minimize left side
    • f(x) : ์ด์ƒ์ ์ธ ๋ชจ๋ธ๋ง์— ์˜ํ•œ ์˜ˆ์ธก ๊ฐ’, f'(x) : ์‹ค์ œ ๋ชจ๋ธ๋ง์— ์˜ํ•œ ์˜ˆ์ธก ๊ฐ’ , y : ์‹ค์ œ ๊ฐ’
    • f'(x)-f(x) : ๋ชจ๋ธ๋ง ์˜ค์ฐจ , f'(x)^2 : ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ฏผ๊ฐ๋„
  • ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ์˜ค๋ฅ˜์˜ ์›์ธ์€ ํฌ๊ฒŒ 3๊ฐ€์ง€์ด๋‹ค. ์˜ค๋ฅ˜์˜ ์›์ธ์„ ์ •ํ™•ํ•˜๊ฒŒ ์•ˆ๋‹ค๋Š” ๊ฒƒ์˜ ์˜๋ฏธ๋Š” ์ด๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œํ•จ์ˆ˜๋กœ ์žก์„ ๋•Œ ์ข‹์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

    • ํŽธํ–ฅ(Bias)
      • ํŽธํ–ฅ์ด๋ž€ "์‹ค์ œ ๊ฐ’๊ณผ ์ถ”์ • ๊ฐ’์˜ ์ฐจ์ด์˜ ํ‰๊ท "์„ ์˜๋ฏธํ•œ๋‹ค. (๋ฐ์ดํ„ฐ ์ง‘๋‹จ์„ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•˜๋А๋ƒ์— ๋”ฐ๋ผ ๋ฐ”์ด์–ด์Šค๋ž€ ๊ฒƒ์ด ๋‹ค์–‘ํ•˜๊ฒŒ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ) ํŽธํ–ฅ์ด ํฌ๋ฉด "๊ณผ์†Œ์ ํ•ฉ(Underfitting) ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
    • ๋ถ„์‚ฐ(Variance)
      • ์ˆ˜์‹์„ ํ†ตํ•ด ๋ณด๋ฉด ์ด ๋ถ€๋ถ„์€ "ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ๋ณ€ํ™”์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„"๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ฆ‰, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์•ˆ์ •์„ฑ์ด ML Model์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ. (์ด๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๋…ธ๋ ฅ : Data Structures, Labeling, [Sample Weights])
      • ๋ถ„์‚ฐ์ด ์ž‘์œผ๋ฉด "์ผ๋ฐ˜์ ์ธ ํŒจํ„ด"์„ ๋ชจ๋ธ๋งํ•˜์ง€๋งŒ ๋ถ„์‚ฐ์ด ํฌ๋ฉด ์‹ ํ˜ธ๋ฅผ ์žก์Œ์œผ๋กœ ์˜คํŒํ•˜๊ฒŒ ๋˜์–ด ๊ณผ์ ํ•ฉ(Overfitting) ์ด์Šˆ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
    • ์žก์Œ(Noise)
      • ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•œ ๋ณ€์ˆ˜๋‚˜ ์ธก์ • ์˜ค๋ฅ˜๋กœ ๋”์ด์ƒ ์ค„์ผ ์ˆ˜ ์—†๋Š” ์˜ค๋ฅ˜
      • (*) ๋”์ด์ƒ ์ค„์ผ ์ˆ˜ ์—†๋‹ค๋Š” ๊ฒƒ์ด ์œ„์˜ ์ˆ˜์‹์—์„œ ์‚ฌ์šฉ๋œ ๋ณ€์ˆ˜ x_i์™ธ์˜ ๊ฒƒ์—์„œ ๋‚˜์˜จ ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค.
      • (*) ๋Œ€๋ถ€๋ถ„์˜ ๋ณต์žกํ•œ ๋ฌธ์ œ๋Š” ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ(๋ณ€์ˆ˜) ์™ธ์˜ ๋‹ค๋ฅธ ๋ณ€์ˆ˜๋กœ ์ธํ•ด ์‹œ์žฅ์ด ์›€์ง์ผ ๋•Œ๊ฐ€ ์žˆ๋Š”๋ฐ ์ด๋Ÿฐ ์˜ค๋ฅ˜๋Š” ํ•„์—ฐ์ ์œผ๋กœ ๋ฐœ์ƒํ•  ์ˆ˜ ๋ฐ–์— ์—†์Œ์„ ์ธ์ง€ํ•ด์•ผ ํ•œ๋‹ค.
      • (*) ์ด๋ฅผ ์œ„ํ•ด ๊ทน๋‹จ์ ์ธ ์ƒํ™ฉ์— ๋Œ€ํ•œ ๋ณด์ˆ˜์ ์ธ ๋ฆฌ์Šคํฌ ๊ด€๋ฆฌ๋Š” ํ•„์ˆ˜์ ์ธ ๊ฒƒ์ด๋‹ค.
      • (*) AI๋Š” ์ •๋Ÿ‰ํ™”๋ฅผ ํ†ตํ•œ ์ž๋™ํ™”๋ฅผ ๋„์™€์ฃผ๋Š” ๊ฒƒ์ด์ง€ ๋ชจ๋“  ๊ฒƒ์„ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•œ๋‹ค.
      • (*) ์–ด๋””๊นŒ์ง€ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ณ  ์–ด๋””๊นŒ์ง€ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•˜๋Š”์ง€๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.
  • Ensemble ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜๋ฉด ๋งŽ์€ Weak Model์„ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ Aggregationํ•˜๋ฉด์„œ ํŽธํ–ฅ ๋˜๋Š” ๋ถ„์‚ฐ์„ ์ถ•์†Œํ•˜์—ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ ์›๋ฆฌ์ด๋‹ค.

Bootstrap Aggregation

  • ๋ฐฐ๊น…(Bagging)์€ Bootstrap Aggregation์˜ ์ค„์ž„๋ง์ด๋‹ค. ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ(Source)๋ฅผ Bootstrap์„ ํ†ตํ•ด์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Sampling ์ž๋ฃŒ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ฐ Sampling ์ž๋ฃŒ๋งˆ๋‹ค Modeling์„ ์ ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ Aggregation ํ•˜์—ฌ ์ตœ์ข…์˜ ์˜ˆ์ธก ๊ฐ’์„ ์‚ฐ์ถœํ•œ๋‹ค.
  • ์œ„์˜ ๊ณผ์ •์€ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ์— '๋ณ‘๋ ฌ์ฒ˜๋ฆฌ'๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ตœ์ข… ์˜ˆ์ธก ๊ฐ’์„ ์‚ฐ์ถœํ•  ๋•Œ๋Š” ๊ธฐ๋ณธ์ ์œผ๋ก  '๋‹จ์ˆœ ํ‰๊ท '์„ ํ™œ์šฉํ•œ๋‹ค. ๊ธฐ์ดˆ ์ถ”์ •๊ธฐ(Estimator)๊ฐ€ ์˜ˆ์ธก ํ™•๋ฅ ์„ ๊ฐ–๊ณ  ์žˆ๋Š” ์ •๋„(P>0.5)์ธ ๊ฒฝ์šฐ์—” Bagging์ด ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค.
  • Bagging์€ '์˜ˆ์ธก ์˜ค์ฐจ'์˜ ๊ตฌ์กฐ์—์„œ '๋ถ„์‚ฐ'์„ ๋‚ฎ์ถฐ์„œ ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ฒŒ ํ•ด์ค€๋‹ค. ์ฆ‰, ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.

Variance Reduction & Improved Accuracy

image

  • image image
  • image image
    • phi_i(c) : ๋ฐฐ๊น…๋œ ๋ชจ๋ธ์˜ ์ตœ์ข… ์˜ˆ์ธก ๊ฐ„์˜ ๋ถ„์‚ฐ , N : ๋ฐฐ๊น…๋œ ๋ชจ๋ธ์˜ ์ˆ˜ , variance_bar : ๋‹จ์ผ ์ถ”์ •๊ธฐ(๋ชจ๋ธ)์˜ ์˜ˆ์ธก์˜ ํ‰๊ท  ๋ถ„์‚ฐ, rho_bar : ๋‹จ์ผ ์ถ”์ •๊ธฐ ์˜ˆ์ธก ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„
    • ์ˆ˜์‹ ์—ฐ๊ฒฐ๊ณ ๋ฆฌ : phi_i(c) = sumvariance_between_bagged_model(i,j) , variance_between_bagged_model(i,j) = ๋ฐฐ๊น…๋œ ๋ชจ๋ธ ๊ฐ„์˜ ๊ณต๋ถ„์‚ฐ

๊ฒฐ๋ก 

image

  • ๋‹จ์ผ ์ถ”์ •๊ธฐ(๊ธฐ์ดˆ ๋ชจ๋ธ)๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ 1๋ณด๋‹ค ์ž‘์ง€ ์•Š์œผ๋ฉด ๋ฐฐ๊น… ๋ชจ๋ธ์˜ ์˜ค์ฐจ ๊ฐœ์„ ์— ๋„์›€์ด ๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ด ์ˆ˜์‹์ ์œผ๋กœ ์ฆ๋ช…๋œ๋‹ค.
  • ๋ฌผ๋ก  ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์‹ค์ „์—์„œ์˜ ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค.
    • ex1) ์ •ํ™•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ž€ ๋ฌด์—‡์ด๊ณ  ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ์ •๋Ÿ‰ํ™” ํ•  ๊ฒƒ์ธ๊ฐ€? a) ์ •ํ™•ํ•œ ๊ฒƒ์€ ์•Œ ์ˆ˜ ์—†๊ณ  ๊ฐ€์„ค์„ ์„ธ์šฐ๊ณ  ์˜๋ฏธ์žˆ๋Š” ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” mathematical expression์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ตœ์„ . ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ์„œ๋„ ๊ณ„์† ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋ณ€ํ• ํ…๋ฐ ์ด๋Ÿฐ ํ™•๋ฅ ์ ์ธ ๋ถ€๋ถ„๊นŒ์ง€๋„ ๊ณ ๋ คํ•˜๋Š” ์ƒ๊ด€๊ด€๊ณ„ modeling์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด best

Observation Redundancy

  • 1) ๊ด€์ธก ์ค‘๋ณต ์ด์Šˆ
    • ๊ด€์ธก ์ค‘๋ณต์€ ์œ„์—์„œ ์–˜๊ธฐํ–ˆ๋“ฏ ์‚ฌ์‹ค์ƒ ๊ฐ Bootstrap sampling๋“ค์ด '์‚ฌ์‹ค์ƒ' ๋™์ผํ•ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์œผ๋ฉฐ ์ด๋Š” ํ‰๊ท ์ ์ธ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ 1์— ์ˆ˜๋ ดํ•ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ ๋ชจ๋ธ์„ ๋งŽ์ด ๋งŒ๋“ค์ง€๋งŒ Bagging ๋ชจ๋ธ์˜ ๋ถ„์‚ฐ์„ ์ค„์ด์ง€ ๋ชปํ•˜์—ฌ ํšจ๊ณผ๊ฐ€ ์—†์–ด์ง์„ ์˜๋ฏธํ•œ๋‹ค.
    • ์ฐธ๊ณ ) Sample Weights
  • 2) OOS ์‹ ๋ขฐ์„ฑ ์•ฝํ™” ์ด์Šˆ
    • ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ ๋ณต์›์„ ๋™๋ฐ˜ํ•œ ๋ฌด์ž‘์œ„ ํ‘œ๋ณธ ์ถ”์ถœ๋กœ ์ธํ•ด OOB์™€ ๋งค์šฐ ํก์‚ฌํ•œ train set ํ‘œ๋ณธ๋“ค์„ ๋‹ค์ˆ˜ ์ƒ์„ฑํ•˜๊ณ  train set๊ณผ ์‚ฌ์‹ค์ƒ ์œ ์‚ฌํ•œ test set์„ ๋งŒ๋“ค์–ด ๋ฒ„๋ฆฐ๋‹ค. ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์œผ๋ก  k๊ฐœ์˜ ๋ธ”๋Ÿญ์œผ๋กœ ์ชผ๊ฐ  ๋’ค์— k๊ฐœ ์•ˆ์—์„œ ์ถ”์ถœํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ•œ๋‹จ๋ฉด ์ด๋Ÿฐ ํ˜„์ƒ์„ ์–ด๋А ์ •๋„ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค.
    • OOB๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ๊ด€๋ จํ•œ ๋ชจ๋ธ์—์„  ๋ฌด์‹œํ•˜๋Š” ๊ฒƒ์ด ๋‚ซ๋‹ค. ์ด์ „์˜ ๊ณผ์ •๋“ค์ด ๋ฐ์ดํ„ฐ๋“ค์„ IIDํ™” ์‹œํ‚ค๋Š” ์ž‘์—…์„ ํ–ˆ์ง€๋งŒ ์—ฌ์ „ํžˆ ์™„๋ฒฝํ•œ IID๋ž€ ์กด์žฌํ•  ์ˆ˜ ์—†๋‹ค. ์ด๋Ÿฐ ๊ณผ์ •์—์„œ OOB๋Š” ๋ฏธ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ train ํ•ด๋ฒ„๋ฆฌ๋Š” ํšจ๊ณผ๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ํผํฌ๋จผ์Šค์— ๋ฒ„๋ธ”์ด ์žˆ๊ฒŒ ๋œ๋‹ค.

Random Forest

Concept

  • Decision Tree๋Š” ๊ณผ์ ํ•ฉ ๋˜๊ธฐ ์‰ฝ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๊ณ  ๊ฐ DT๊ฐ€ ๊ณผ์ ํ•ฉ ๋˜์–ด ์žˆ๋‹ค๋ฉด ์ด๋ฅผ ํ™œ์šฉํ•ด ๋‹จ์ˆœํ•œ Forest๋ฅผ ๋งŒ๋“ค ๊ฒฝ์šฐ Bagging๋œ ๋ชจ๋ธ์˜ Variance๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.
  • ์ตœ์šฐ์„  ๊ณผ์ œ๋Š” ๊ฐœ๋ณ„ DT๊ฐ€ ๊ณผ์ ํ•ฉ ๋˜์ง€ ์•Š๋„๋ก ๋…ธ๋ ฅํ•˜๋Š” ๊ฒƒ์ด๊ณ  ์ดํ›„์— 'Random' Forest๋ฅผ ํ™œ์šฉํ•ด Aggregation๋œ ๋ชจ๋ธ์˜ Variance๋ฅผ ๋‚ฎ์ถ˜๋‹ค.
  • Details & Reference

vs Bagging

  • ๊ณตํ†ต์  : Random Forest๋Š” ๊ฐœ๋ณ„ ์ถ”์ •๊ธฐ(Estimator, ๊ธฐ๋ณธ ๋ชจ๋ธ)๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ Bootstrap Sampling์„ ๊ฐ€์ง€๊ณ  ํ›ˆ๋ จํ•œ๋‹ค๋Š” ์ ์—์„œ ๋น„์Šทํ•˜๋‹ค.
  • ์ฐจ์ด์  : Random Forest๋Š” 2์ฐจ level์˜ Randomness๋ฅผ ํฌํ•จํ•œ๋‹ค.
    • Bagging์€ Feature๋Š” ๊ณ„์† ๋™์ผํ•˜๊ณ  Bootstrap ๋ถ€๋ถ„์—๋งŒ Randomness๋ฅผ ๋„ฃ์€ ๋ชจ๋ธ์ธ ๋ฐ˜๋ฉด, Random Forest๋Š” ๊ฐœ๋ณ„ ์ถ”์ •๊ธฐ(Estimator)๋ฅผ ๋งŒ๋“ค ๋•Œ M๊ฐœ์˜ Feautre ์ค‘์—์„œ N๊ฐœ๋ฅผ ๋žœ๋ค์œผ๋กœ ํ•œ ๋ฒˆ ์ถ”์ถœํ•œ ํ›„ ์ด๋ฅผ ํ™œ์šฉํ•ด Bootstrap Sampling๋“ค์— ๋Œ€ํ•ด ํ›ˆ๋ จํ•˜๊ณ  Aggregation ์ž‘์—…์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ฆ‰, Bagging๊ณผ ๋‹ค๋ฅด๊ฒŒ 2๋‹จ๊ณ„(Feature Sampling + Data Bootstrap Sampling)๋ฅผ ๊ฑฐ์นœ๋‹ค.
    • Bootstrap ํ‘œ๋ณธ ํฌ๊ธฐ๊ฐ€ train data set ํฌ๊ธฐ์™€ ์ผ์น˜ํ•ด์•ผ ํ•œ๋‹ค. (Bagging์€ ์ƒ๊ด€์—†๋‹ค.)

Advantage

  • Random Forest๋Š” Bagging๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์˜ˆ์ธก์˜ ๋ถ„์‚ฐ์„ ๊ณผ์ ํ•ฉ์„ ์ตœ๋Œ€ํ•œ ๋ฐฉ์ง€ํ•˜๋ฉด์„œ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.
    • Bagging๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐœ๋ณ„ ์ถ”์ •๊ธฐ(Estimator) ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๋ถ„์‚ฐ์„ ์ค„์ด๋Š” ํšจ๊ณผ๋ฅผ ๊ฑฐ์˜ ๋ชป ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • Feature Importance๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฐธ๊ณ ) Feature Importance

Tips for avoiding Overfitting

  • Random Forest์˜ max_features๋ฅผ ๋‚ฎ์€ ๊ฐ’์„ ์„ค์ •. ์ด๋Š” Tree๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‚ฎ์ถฐ์คŒ.
  • Early Stopping : min_weight_fraction_leaf ์ •๊ทœํ™” ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์ถฉ๋ถ„ํžˆ ํฐ ์ˆ˜๋กœ ์„ค์ •(5%)
  • Bagging + DecisionTree & max_samples=avgU (ํ‰๊ท  ๊ณ ์œ ์„ฑ, Sample Weights)
(a) clf=DecisionTreeClassifier(criterion='entropy',max_features='auto',class_weight='balanced')
(b) bc=BaggingClassifier(base_estimator=clf,n_estimators=1000,max_samples=avgU,max_features=1.)
  • Bagging + RandomForest & max_samples=avgU (ํ‰๊ท  ๊ณ ์œ ์„ฑ, Sample Weights)
(a) clf=RandomForestClassifier(n_estimators=1,criterion='entropy',bootstrap=False,class_weight='balanced_subsample')
(b) bc=BaggingClassifier(base_estimator=clf,n_estimators=1000,max_samples=avgU,max_features=1.)
  • Bagging + RandomForest with Sequential Bootstrap ( Sequential Bootstrap Code)
    • RandomForest์˜ Parameter ๋ถ€๋ถ„์„ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด์„œ ์‚ฌ์šฉ
  • class_weight = 'balanced_subsample' : RandomForest Hyper Params
  • Feature์˜ PCA + RandomForest
    • Feature์— PCA๋ฅผ ํ™œ์šฉํ•˜์—ฌ Feature ์ˆ˜๋ฅผ ์ค„์—ฌ์„œ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๊ฐ ๊ฐœ๋ณ„ ์ถ”์ •๊ธฐ์ธ Decision Tree๊ฐ€ ํ•„์š”๋กœ ํ•˜๋Š” depth๊ฐ€ ์ค„์–ด๋“ค ์ˆ˜ ์žˆ๊ณ  ์ด๋Š” ๊ฐ ๊ฐœ๋ณ„ ์ถ”์ •๊ธฐ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‚ฎ์ถ”๋ฏ€๋กœ Variance๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์— ํšจ๊ณผ์ ์ด๋‹ค.

Boosting

Concept

  • Weak Estimator๋“ค์„ ํ™œ์šฉํ•ด ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ์ง€์— ๋Œ€ํ•œ ์˜๋ฌธ์œผ๋กœ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ–ˆ๋‹ค. (Bagging์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  ์ถ”์ •๊ธฐ๊ฐ€ ์•ฝํ•˜๋ฉด ์„ฑ๋Šฅ ๊ฐœ์„ ์˜ ํšจ๊ณผ๊ฐ€ ๋ฏธ๋ฏธํ•˜๋‹ค)
  • Details & Reference

Decision Flow

image

  1. ์–ด๋–ค ์ถ”์ถœ ํ™•๋ฅ  ๋ถ„ํฌ์— ๋”ฐ๋ผ ๋ณต์›์„ ๋™๋ฐ˜ํ•œ ์ถ”์ถœ์„ ์ง„ํ–‰ํ•˜์—ฌ 'ํ•˜๋‚˜'์˜ train set์„ ์ƒ์„ฑ (์ดˆ๊ธฐ ํ™•๋ฅ  ๋ถ„ํฌ๋Š” Uniform)
  2. ํ•˜๋‚˜์˜ Estimator๋ฅผ ์œ„์˜ train set์„ ํ™œ์šฉํ•ด fitting
  3. ์œ„์˜ Estimator๊ฐ€ performance metric ์ž„๊ณ„๊ฐ’์„ ๋„˜์–ด์„œ๋ฉด ํ•ด๋‹น Estimator๋Š” ๋“ฑ๋กํ•˜๊ณ  ์•„๋‹ˆ๋ฉด ํ๊ธฐ
  4. ์ดํ›„์— ์ž˜๋ชป ์˜ˆ์ธก๋œ label์—๋Š” ๋” ๋งŽ์€ ๊ฐ€์ค‘ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๊ณ  ์ •ํ™•ํžˆ ์˜ˆ์ธก๋œ label์—” ๋‚ฎ์€ ๊ฐ€์ค‘ ๊ฐ’์„ ๋ถ€์—ฌํ•œ๋‹ค.
  5. N๊ฐœ ์ถ”์ •๊ธฐ ์ƒ์„ฑํ•  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต
  6. N๊ฐœ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘ ํ‰๊ท ์ด๊ณ  ๊ฐ€์ค‘ vector๋Š” ๊ฐœ๋ณ„ Estimator์˜ performance(ex. accuracy)์— ๋”ฐ๋ผ ๊ฒฐ์ •
  • ์œ„์˜ ๊ธฐ๋ณธ์ ์ธ flow๋Š” ๋ชจ๋“  boosting ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ผํ•˜๋ฉฐ ๊ฐ€์žฅ ์œ ๋ช…ํ•œ ๊ฒƒ์€ AdaBoost

Bagging vs Boosting

Comparison

  • ๊ฐ Estimator๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ณผ์ • : Parallel(Bag) vs Sequential(Boost)
  • Estimator ์„ ๋ณ„ ์—ฌ๋ถ€ : X(Bag) vs O(Boost)
  • Probability Distribution of Sampling์˜ ๋ณ€ํ™” : X(Bag) vs O(Boost)
  • Ensemble ์˜ˆ์ธก ๊ฐ’ Average ๋ฐฉ์‹ : Equal(Bag) vs Weight(Boost)

In Finance

  • Boosting์€ Bias,Variance๋ฅผ ๋ชจ๋‘ ๊ฐ์†Œ์‹œํ‚ค์ง€๋งŒ ์ฃผ์–ด์ง„ Data์— Overfittingํ•˜๊ฒŒ ํ•˜๋Š” ์š”์†Œ๊ฐ€ ๋‹ค์ˆ˜ ํฌํ•จ๋˜์–ด ์žˆ์Œ
  • ๊ธˆ์œต ๋ฐ์ดํ„ฐ๋Š” Underfitting ๋ณด๋‹ค๋Š” Overfitting์ด ๋ฌธ์ œ๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ๋ณธ์ ์œผ๋กœ Bagging์ด ์„ ํ˜ธ

Bagging For Scalability

  • SVM ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” train data์˜ observation์ด 100๋งŒ๊ฐœ ์ •๋„ ๋˜๋ฉด ์†๋„๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋А๋ ค์ง.
    • ์†๋„์— ๋น„ํ•ด ์ตœ์  ๋ณด์žฅ, ๊ณผ์ ํ•ฉ ๋ณด์žฅ ์ธก๋ฉด์—์„œ ๋ฉ”๋ฆฌํŠธ๊ฐ€ ์—†์Œ
  • ๊ฐœ๋ณ„ Estimator๋กœ๋Š” SVM์„ ์‚ฌ์šฉํ•˜๊ณ  ์ œํ•œ ์กฐ๊ฑด์— ๊ณผ์ ํ•ฉ์ด ๋˜์ง€ ์•Š๋„๋ก ์„ค์ •ํ•œ ํ›„ bagging์„ ํ•˜๊ฒŒ ๋˜๋ฉด ์†๋„ ๋ฌธ์ œ์™€ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Œ
    • Bagging : Bagging ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅ
  • Scalability Example
    • Logistic + Bagging (or RandomForest style)
    • SVM + Bagging (or RandomForest style)
    • Random Forest + Bagging
    • Boosting + Bagging (or RandomForest style)
    • Linear SVC + Bagging (or RandomForest style)
    • ref : Combining Models (Bagging style + Base Estimator)