Feature Importance - jaeaehkim/trading_system_beta GitHub Wiki

Motivation

  • ์ž˜๋ชป๋œ ์ ˆ์ฐจ : ํŠน์ • Data ์„ ํƒ > ML ์•Œ๊ณ ๋ฆฌ์ฆ˜ fitting > backtest ํ•ด๋‹น loop ๋ฐ˜๋ณต
    • ๋™์ผํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ง€์†์ ์œผ๋กœ ํ…Œ์ŠคํŠธ๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์€ ์ž˜๋ชป๋œ ๋ฐœ๊ฒฌ์œผ๋กœ ๊ท€๊ฒฐ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ. ๊ฑฐ์ง“ ํˆฌ์ž ์ „๋žต์„ ๋ฐœ๊ฒฌํ•˜๋Š”๋ฐ๋Š” 20๋ฒˆ ์ •๋„์˜ ๋ฐ˜๋ณต์ด๋ฉด ๊ฐ€๋Šฅ.

The Importance of Feature Importance

  • Backtest๋Š” ๋งค์šฐ ์‰ฝ๊ฒŒ ๊ณผ์ ํ™” ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ์‚ฌ์‹ค์„ ์ฒซ ๋ฒˆ์งธ๋กœ ์ธ์ง€ํ•ด์•ผ ํ•จ
  • Cross Validation์œผ๋กœ ๋ชจ๋ธ์„ fittingํ•œ ํ›„ ํ•ด๋‹น ๋ชจ๋ธ์˜ metric์„ ๋ณด๊ณ  ์–ด๋–ค Feature์— ์˜ํ•ด์„œ ๋ชจ๋ธ์˜ metric์ด ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋ƒˆ๋Š”์ง€๋ฅผ ํŒ๋‹จํ•ด์•ผ ํ•จ. Feature์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•˜๊ณ  Importance๋ฅผ ์ •๋Ÿ‰ํ™” ํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์ด Research
  • ์ง€๋„ํ•™์Šต ๋ชจ๋ธ๋“ค์˜ Feature๋ฅผ ์—ฐ๊ตฌํ•˜๋Š” ์ˆœ๊ฐ„ black box๊ฐ€ ์•„๋‹ˆ๊ฒŒ ๋œ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์˜ ํ•ต์‹ฌ์€ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์—์„œ ์ง€์ •ํ•ด์ค˜์•ผ ํ•  ์ˆ˜๋งŽ์€ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค๋Š” ์ .
    • Lopez์˜ ๋น„์œ  : ์‚ฌ๋ƒฅ๊พผ๋“ค์€ ์ž์‹ ์˜ ์‚ฌ๋ƒฅ๊ฐœ๊ฐ€ ์žก์•„์˜จ ๋ชจ๋“  ๊ฒƒ์„ ๋จน์ง€ ์•Š๋Š”๋‹ค.
  • Research
    • Q1) ํ•ด๋‹น Feature๊ฐ€ ๋Š˜ ์ค‘์š”ํ•œ๊ฐ€? ํŠน์ • ํ™˜๊ฒฝ์—์„œ๋งŒ ์ค‘์š”ํ•œ๊ฐ€?
    • Q2) ์‹œ๊ฐ„์— ๋”ฐ๋ผ Feature์˜ ์ค‘์š”์„ฑ์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?
    • Q3) ์ƒํ™ฉ ๋ณ€ํ™”(regime)์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    • Q4) ์ค‘์š”ํ•˜๋‹ค๊ณ  ๊ฒฐ๋ก ๋‚ธ Feature๊ฐ€ ๋‹ค๋ฅธ ๊ธˆ์œต ์ƒํ’ˆ์—์„œ๋„ ์ค‘์š”ํ•œ๊ฐ€?
    • Q5) ๋‹ค๋ฅธ ์ž์‚ฐ ๋ถ€๋ฅ˜์™€๋„ ๊ด€๋ จ์ด ์žˆ๋Š”๊ฐ€?
    • Q6) ๋ชจ๋“  ๊ธˆ์œต์ƒํ’ˆ์— ์žˆ์–ด์„œ ๊ฐ€์žฅ ๊ณตํ†ต์ ์œผ๋กœ ๊ด€๋ จ์žˆ๋Š” Feature๋Š” ๋ฌด์—‡์ธ๊ฐ€?

Feature Importance with Substitution Effects

  • Feature Importance๋ฅผ ํ•ด์„ํ•  ๋•Œ ์ฃผ์˜ํ•ด์•ผ ํ•  ์ ์€ Substitution Effects์ด๋‹ค. ์ด๋Š” ํ†ต๊ณ„์—์„œ **๋‹ค์ค‘๊ณต์„ ์„ฑ(Multi-collinearity)**๊ณผ ์ฒ ํ•™์ ์œผ๋ก  ๋™์ผํ•˜๋‹ค.
  • ๋ฐฉ๋ฒ•๋ก ๋“ค์€ Substitution Effects์— ์˜ํ–ฅ์„ ๋ฐ›์•„์„œ ์‹ค์ œ๋กœ ๋ฐ›์•„์•ผ ํ•  ์ค‘์š”์„ฑ ๋ณด๋‹ค ๋‚ฎ๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, ๊ฒฐํ•ฉ ํšจ๊ณผ๋ฅผ ๊ณ ๋ คํ•œ Feature Importance ๋ฐฉ๋ฒ•๋ก ์ด ์žˆ๊ณ  ๊ฐœ๋ณ„ ๋‹จ์œ„๋กœ Feature Importance๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋Œ€์ฒด ํšจ๊ณผ์— ๊ด€ํ•ด์„  ์˜ํ–ฅ ๋ฐ›์ง€ ์•Š์œผ๋‚˜ ๊ฒฐํ•ฉ ํšจ๊ณผ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 2๊ฐ€์ง€ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ฐฉ๋ฒ•๋ก ์„ Ensembleํ•ด์„œ ์“ฐ๋Š” ๊ฒƒ์ด ์ข‹์Œ.
  • ์œ„์˜ ๋ฐฉ๋ฒ•์€ Quant Researcher๋“ค์ด ๋งŒ๋“ค์–ด๋‚ธ ๋‹ค์–‘ํ•œ Feature ์ค‘์—์„œ noise๊ฐ€ ์•„๋‹Œ Feature Selection์„ ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์—ฌ๊ธฐ์„œ PCA(Principal Component Analysis) ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์ฐจ์›์„ ์žฌ์กฐํ•ฉํ•˜๊ณ  ์ฃผ์š” ํŠน์ง•์œผ๋กœ๋งŒ Feature๋ฅผ ๋งŒ๋“ค์–ด๋‚ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ๋‹ค๋งŒ, ์ด ๋ฐฉ๋ฒ•์€ ์„ ํ˜• ์กฐํ•ฉ๋œ ์ฃผ์š” ํŠน์ง•๋“ค์€ ๋„๋ฉ”์ธ์ด ๋ฐ˜์˜๋˜์ง€ ์•Š์€ ์ˆ˜์น˜ ์ž์ฒด๋กœ๋งŒ ์ตœ์ ํ™” ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์•ฝ๊ฐ„์˜ ์œ„ํ—˜์„ฑ์ด ์กด์žฌํ•˜๊ณ  PCA๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ ์š”์•ฝ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๋Š” Tool๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์ด๋‹ค.
  • Importance๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ๋•Œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋А๋ƒ์— ๋”ฐ๋ผ In-sample(IS) ๋ฐฉ์‹๊ณผ Out-of-sample(OS) ๋ฐฉ์‹์ด ์žˆ๋Š”๋ฐ ์—ฌ๋Ÿฌ ์šฉ์–ด๊ฐ€ ํ˜ผ์šฉ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด ์ •๋ฆฌ
    • Explanatory ๋ฐฉ์‹ (In-Sample)
      • ์„ค๋ช…(Explanation), Train data, ํ‘œ๋ณธ๋‚ด ์„ฑ๋Šฅ ๊ฒ€์ฆ(in-sample testing), explanatory-importance, explanatory-regression analysis
    • Predictve ๋ฐฉ์‹ (Out-of-Sample)
      • ์˜ˆ์ธก(Prediction), Test data, ํ‘œ๋ณธ์™ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ(out-of-sample testing), predictive-importance, predictive-regression analysis

MDI (Mean Decrease Impurity)

MDI์— ๋Œ€ํ•œ ์„ค๋ช…

image

  • image
    • ๋‹ค๋ฅธ ์ด๋ฆ„์œผ๋กœ๋Š” Gini Importance ๋ผ๊ณ ๋„ ๋ถˆ๋ฆฐ๋‹ค. ์œ„์˜ ์‹์€ Gini Impurity ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋ฉฐ Decision-Tree ๋ชจ๋ธ์—์„œ ๊ฐ ๋…ธ๋“œ์˜ Impurity๋ฅผ ๋‚ฎ์ถ”๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™” ์ž‘์—…์ด ์ง„ํ–‰๋œ๋‹ค. ์œ„์˜ ์‹์„ ์ง๊ด€์ ์œผ๋กœ ํ•ด์„ํ•˜๋ฉด ๊ฐ Class์— Sample์ด ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด ์žˆ์„์ˆ˜๋ก Impurity๋Š” ๋†’์•„์ง€๊ณ  ํ•œ ๋ฐฉํ–ฅ์— ์น˜์šฐ์น  ์ˆ˜๋ก ๋‚ฎ์•„์ง„๋‹ค. (์ˆœ๋„, homogeneity ์ฆ๊ฐ€)
  • image
    • Tree ๋ชจ๋ธ์€ Node๋ฅผ ๊ณ„์† ํƒ€๊ณ  ๋‚ด๋ ค๊ฐ€๋Š” ๊ตฌ์กฐ์ด๊ณ  ๊ฐ Node์˜ Importance๋„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. w_j๋Š” ์ „์ฒด ์ƒ˜ํ”Œ ์ˆ˜์— ๋Œ€ํ•œ ๋…ธ๋“œ C_j์— ํ•ด๋‹นํ•˜๋Š” ์ƒ˜ํ”Œ ์ˆ˜์˜ ๋น„์œจ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ  j_left, j_right๋Š” c_j ๋‹ค์Œ์—์„œ ๊ฐˆ๋ผ์ง€๋Š” ๋‘ node๋ฅผ ์˜๋ฏธํ•œ๋‹ค. node importanc ๊ฐ’์ด ํด์ˆ˜๋ก ํ•ด๋‹น ๋…ธ๋“œ์—์„œ Impurity๊ฐ€ ํฌ๊ฒŒ ๊ฐ์†Œํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.
  • image image
    • ๊ฐ feature์˜ ๊ณ„์‚ฐ์€ Node Importance๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ณ , 2๋ฒˆ์งธ ์‹์€ normalizeํ•œ ๊ฒƒ์ด๋‹ค
    • Question) Splitํ•˜๋Š” Feature์˜ Order์— ๋”ฐ๋ผ Importance ๊ณ„์‚ฐ ๊ฐ’์ด ๋‹ฌ๋ผ์งˆ ๊ฒƒ ๊ฐ™์€๋ฐ DecisionTree model์€ ์–ด๋–ค ์‹์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š”๊ฐ€?
      • ์ถ”์ธก) ์ฒซ node ๋งˆ๋‹ค impurity๋ฅผ ๊ฐ€์žฅ ๋‚ฎ๊ฒŒ ํ•ด์ฃผ๋Š” ์ฃผ์š” feature๋ถ€ํ„ฐ ๊ณ„์‚ฐํ•˜๋Š” ์‹์œผ๋กœ ordering์„ ์ง„ํ–‰ํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ -A) random seed์˜ ์กด์žฌ์— ๋”ฐ๋ผ ๊ฐ’์ด ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฑธ ๋ณด๋ฉด random์œผ๋กœ ordering์ด ์ง„ํ–‰๋˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„

MDI ํŠน์ง•

  • Random Forest, Decision-Tree ๋ชจ๋ธ์— ํŠนํ™”๋˜์–ด ์žˆ์–ด Feature Importance ๊ณ„์‚ฐ ์‹œ์— model dependency๊ฐ€ ์กด์žฌ. (Tree ๊ธฐ๋ฐ˜ ์•„๋‹Œ ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉ X) Louppe, 2013
    • Model Specific
  • ํŠน์ • Feature์— Bias๋ฅผ ๊ฐ–์„ ์ˆ˜ ์žˆ์–ด์„œ ๋‹ค๋ฅธ ํŠน์ง•์„ ๋ฌด์‹œํ•˜๋Š” Mask Effect ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ. Storbl, 2007
  • ํ‘œ๋ณธ ๋‚ด ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด๋ฏ€๋กœ ์˜ˆ์ธก๋ ฅ์ด ์—†๋Š” Feature๋„ ์ค‘์š”๋„๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋‚˜์˜ด.
  • ์ค‘์š”๋„๋Š” 0~1์˜ ๊ฐ’์„ ๊ฐ–๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ์ข‹์€ ํŠน์„ฑ์„ ์ง€๋‹˜.
  • ๋Œ€์ฒดํšจ๊ณผ์— ์˜ํ•ด์„œ ๋™์ผํ•œ ๋‘ ๊ฐ€์ง€ ํŠน์ง•์ด ์žˆ๋Š” ๊ฒฝ์šฐ ์‹ค์งˆ ์ค‘์š”๋„ ๋ณด๋‹ค ์ ˆ๋ฐ˜์œผ๋กœ ๊ฐ์†Œ๋จ.
def featImpMDI(fit, featNames):
    # feat importance based on IS mean impurity reduction
    df0 = {i: tree.feature_importances_ for i, tree in enumerate(fit.estimators_)}
    df0 = pd.DataFrame.from_dict(df0, orient='index')
    df0.columns = featNames
    df0 = df0.replace(0, np.nan) # because max_features = 1
    imp = pd.concat({'mean': df0.mean(), 'std': df0.std() * df0.shape[0] ** -0.5}, axis=1)
    imp /= imp['mean'].sum()
    return 

Mean Decrease Accuracy

MDA์— ๋Œ€ํ•œ ์„ค๋ช…

  • Permutation Feature Importance ๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋ฉฐ ๊ธฐ๋ณธ์ ์ธ ์ปจ์…‰์€ X1~X_n๊นŒ์ง€ n๊ฐœ์˜ feature๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์— ์ œ๋Œ€๋กœ ๋ชจ๋“  n๊ฐœ์˜ feature๋ฅผ train ํ–ˆ์„ ๋•Œ์˜ metric performance์™€ n๊ฐœ ์ค‘์˜ 1๊ฐœ์˜ feature๋ฅผ shuffleํ•˜์—ฌ train ํ–ˆ์„ ๋•Œ์˜ ์„ฑ๋Šฅ ์†์‹ค ์ •๋„๊ฐ€ ํด ์ˆ˜๋ก ์ค‘์š”ํ•œ feature๋กœ ํŒ๋‹จํ•œ๋‹ค. ๊ทธ๋ ‡๊ธฐ์— model dependency๊ฐ€ ์—†๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

MDA ํŠน์ง•

  • ํ‘œ๋ณธ์™ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ(out-of-sample testing)์ด ์ง„ํ–‰๋จ.
  • ๋ชจ๋“  Classifier์— ์ ์šฉ ๊ฐ€๋Šฅํ•จ
  • ์„ฑ๋Šฅ metric์„ ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ. ex) Accuracy, F1 Score, neg log loss
  • ์ƒ๊ด€๋œ ํŠน์ง•์ด ์žˆ๋Š” ๊ฒฝ์šฐ importance ๊ฐ’์— ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • MDI์™€ ๋‹ฌ๋ฆฌ ๋ชจ๋“  ํŠน์ง•์ด ์ค‘์š”ํ•˜์ง€ ์•Š๋Š” ๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ์Œ. why? ํ‘œ๋ณธ์™ธ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด๊ธฐ ๋•Œ๋ฌธ.
def featImpMDA(clf, X, y, cv, sample_weight, t1, pctEmbargo, scoring='neg_log_loss'):
    # feat importance based on OOS score reduction
    if scoring not in ['neg_log_loss', 'accuracy']:
        raise ValueError('wrong scoring method')
    from sklearn.metrics import log_loss, accuracy_score
    cvGen = PurgedKFold(n_splits=cv, t1=t1, pctEmbargo=pctEmbargo)
    scr0, scr1 = pd.Series(), pd.DataFrame(columns=X.columns)
    for i, (train, test) in enumerate(cvGen.split(X=X)):
        X0, y0, w0 = X.iloc[train, :], y.iloc[train], sample_weight.iloc[train]
        X1, y1, w1 = X.iloc[test, :], y.iloc[test], sample_weight.iloc[test]
        fit = clf.fit(X=X0, y=y0, sample_weight=w0.values)
        if scoring == 'neg_log_loss':
            prob = fit.predict_proba(X1)
            scr0.loc[i] = -log_loss(y1, prob, sample_weight=w1.values, labels=clf.classes_)
        else:
            pred = fit.predict(X1)
            scr0.loc[i] = accuracy_score(y1, pred, sample_weight=w1.values)
        
        for j in X.columns:
            X1_ = X1.copy(deep=True)
            np.random.shuffle(X1_[j].values) # permutation of a single column
            if scoring == 'neg_log_loss':
                prob = fit.predict_proba(X1_)
                scr1.loc[i, j] = -log_loss(y1, prob, sample_weight=w1.values, labels=clf.classes_)
            else:
                pred = fit.predict(X1_)
                scr1.loc[i, j] = accuracy_score(y1, pred, sample_weight=w1.values)
    
    imp = (-scr1).add(scr0, axis=0)
    if scoring == 'neg_log_loss':
        imp = imp / -scr1
    else:
        imp = imp / (1.0 - scr1)
    
    imp = pd.concat({'mean': imp.mean(), 'std': imp.std() * imp.shape[0] ** -0.5}, axis=1)
    return imp, scr0.mean()
  • ์ฝ”๋“œ ๋ถ„์„
    • PurgedKFold, if scoring == 'neg_log_loss': ์ด ๋ถ€๋ถ„์€ Cross-Validataion-in-Model ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์™œ ์ด๋ ‡๊ฒŒ ์งค ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ํ•ต์‹ฌ์ ์œผ๋กœ ๋ด์•ผํ•  ๋ถ€๋ถ„์€ scr0.loc[i] = -log_loss(y1, prob, sample_weight=w1.values, labels=clf.classes_) src0๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ถ€๋ถ„ ์ด ๋ถ€๋ถ„์€ ๋ชจ๋“  feature๋ฅผ ์ œ๋Œ€๋กœ ๋„ฃ๊ณ  ๊ณ„์‚ฐํ•˜๋Š” ํŒŒํŠธ์ด๊ณ  neg_log_loss์™€ accuracy ๋ชจ๋‘ ๊ตฌํ˜„๋˜์–ด ์žˆ๋Š” ์ƒํƒœ
    • np.random.shuffle(X1_[j].values) # permutation of a single column ์„ ํ†ตํ•ด์„œ j๋ฒˆ์งธ column์„ ์„ž๊ณ  ๊ทธ๋Œ€์˜ score๋ฅผ src1์— ์ €์žฅํ•จ.
    • src0, src1์„ ๊ฐ€์ง€๊ณ  importance๋ฅผ imp = (-scr1).add(scr0, axis=0) ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•จ.

Feature Importance without Substitution Effects

  • Substitution Effects๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์€ Feature Importance ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ์ค‘์š”ํ•œ Feature๋ฅผ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฑธ๋กœ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ ๋ณด์™„ํ•  ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. SFI(Single Feature Importance)๋ฅผ ํ™œ์šฉํ•ด ๋ณด์™„ํ•  ์ˆ˜ ์žˆ๋‹ค.

Single Feature Importance

SFI ์„ค๋ช…

  • Feature ํ•˜๋‚˜์”ฉ Performance๋ฅผ ์ธก์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Cross-sectional ํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ณ , metric์€ accuracy, neg log loss ๋ฌด์—‡์ด๋“  ์ƒ๊ด€์—†๋‹ค.

SFI ํŠน์ง•

  • ๋ชจ๋“  Classifier์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • metric์œผ๋กœ ์ •๋Ÿ‰ํ™”๋œ ๋ชจ๋“ˆ ์•„๋ฌด๊ฑฐ๋‚˜ ์‚ฌ์šฉํ•ด๋„ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • out-of-sample testing ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ํŠน์ง•์ด ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋‹ค.
def auxFeatImpSFI(featNames, clf, trnsX, cont, scoring, cvGen):
    imp = pd.DataFrame(columns=['mean', 'std'])
    for featName in featNames:
        df0 = cvScore(clf, X=trnsX[featName](/jaeaehkim/trading_system_beta/wiki/featName), y=cont['bin'], sample_weight=cont['w'], scoring=scoring, cvGen=cvGen)
        imp.loc[featName, 'mean'] = df0.mean()
        imp.loc[featName, 'std'] = df0.std() * df0.shape[0] ** -0.5
    return imp
  • ์ฝ”๋“œ ๋ถ„์„
    • for featName in featNames:์„ ํ†ตํ•ด single feature ๋งˆ๋‹ค loop
    • single feature ๋งˆ๋‹ค cvScore(clf, X=trnsX[featName](/jaeaehkim/trading_system_beta/wiki/featName), y=cont['bin'], sample_weight=cont['w'], scoring=scoring, cvGen=cvGen) metric ๊ณ„์‚ฐํ•จ.
    • ๋งˆ์ง€๋ง‰์—” normalize๋ฅผ ํ•จ

Orthogonal Features

Orthogonal Features ์„ค๋ช…

  • image image image
  • image image
  • image
  • image
    • PCA๋ฅผ ํ†ตํ•ด ๋‚˜์˜ค๋Š” ์ƒˆ๋กญ๊ฒŒ ์„ ํ˜• ์กฐํ•ฉ๋œ Feature๋Š” ๋ชจ๋“  Substitution Effects๋ฅผ ๊ฐ์†Œ์‹œํ‚ค์ง„ ์•Š์ง€๋งŒ Linear Substitution Effects๋Š” ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
    • Feature Matrix X (t X n) ๊ฐ€ ์žˆ๊ณ  ์ด๋ฅผ sigma_n (1 X n) vector์™€ mu_n (1 X n) vector๋กœ ํ‘œ์ค€ํ™”ํ•œ matrix๊ฐ€ Z (t X n)์ด๋‹ค.
    • ๊ณ ์œณ๊ฐ’ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด์„œ Lambda diagonal matrix (n x n, descending order)์™€ W orhtonormal matrix(n x n)๋ฅผ ๊ตฌํ•œ๋‹ค. Z`Z = n X n matrix
    • orthonormal feature matrix P = ZW๋กœ ๊ณ„์‚ฐํ•˜๊ณ  P`P์˜ ๊ณ„์‚ฐ์„ ํ†ตํ•ด orthonormality๋ฅผ ๊ฒ€์ฆํ•œ๋‹ค.
    • Z๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ณ ์œณ๊ฐ’ ๋ถ„ํ•ด๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ์ด์œ ?
      • ๋ฐ์ดํ„ฐ ์ค‘์•™ํ™” : ์ฒซ ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„์ด Observations(X, Train data)์˜ ์ฃผ๋ฐฉํ–ฅ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜์‹œ์ผœ ํ‘œํ˜„ํ•˜๊ฒŒ ๋จ.
      • ๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง : ๋ถ„์‚ฐ ๋ณด๋‹ค ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์„ค๋ช…ํ•˜๋Š”๋ฐ ๋” ์ง‘์ค‘. ๋งŒ์•ฝ ์•ˆํ•˜๋ฉด ๋ถ„์‚ฐ์ด ๊ฐ€์žฅ ํฐ Feature์— ์ฃผ๋„ ๋‹นํ•˜๋Š” ๊ฒฐ๊ณผ๊ฐ€ ์‚ฐ์ถœ๋จ.
def get_eVec(dot, varThres):
    eVal, eVec = np.linalg.eigh(dot)
    idx = eVal.argsort()[::-1]
    eVal, eVec = eVal[idx], eVec[:, idx]
    eVal = pd.Series(eVal, index=['PC_'+str(i+1) for i in range(eVal.shape[0])])
    eVec = pd.DataFrame(eVec, index=dot.index, columns=eVal.index)
    eVec = eVec.loc[:,eVal.index]
    cumVar = eVal.cumsum() / eVal.sum()
    dim = cumVar.values.searchsorted(varThres)
    eVal, eVec = eVal.iloc[:dim+1], eVec.iloc[:,:dim+1]
    return eVal, eVec

def orthoFeats(dfX, varThres=0.95):
    dfZ = dfX.sub(dfX.mean(), axis=1).div(dfX.std(), axis=1)
    dot = pd.DataFrame(np.dot(dfZ.T, dfZ), index=dfX.columns, columns=dfX.columns)
    eVal, eVec = get_eVec(dot, varThres)
    dfP = np.dot(dfZ, eVec)
    return  dfP

def PCA_rank(dfX):
    dfZ = dfX.sub(dfX.mean(), axis=1).div(dfX.std(), axis=1)
    dot1 = np.nan_to_num(np.dot(dfZ.T, dfZ))
    eVal1, eVec1 = np.linalg.eig(dot1)

    perm = np.random.permutation(dfZ.columns)
    dfZ = dfZ.reindex(perm, axis=1)
    dot = np.nan_to_num(pd.DataFrame(np.dot(dfZ.T, dfZ), index=perm, columns=perm))
    eVal, eVec = np.linalg.eig(dot)

    return pd.Series(eVal.shape[0] - eVal.argsort().argsort(), index=dfX.columns, name='PCA_rank')
  • ์ฝ”๋“œ ๋ถ„์„
    • get_eVec์„ ํ†ตํ•ด์„œ W matrix (eVec)๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ํ•จ์ˆ˜ ์•ˆ์˜ ๋‚ด์šฉ์€ ์ˆ˜์‹์„ ์ฝ”๋“œํ™” ํ–ˆ์„ ๋ฟ.
    • orthoFeats์„ ํ†ตํ•ด P matrix(dfP, orthonormal features)๋ฅผ ๊ณ„์‚ฐํ•จ.

Orthogonal Features ์ฃผ์š” ํŠน์ง• ๋ฐ ์ฃผ์˜ํ•  ์ 

  • ์ง๊ตํ™”๋ฅผ ํ†ตํ•ด ๊ณ ์œณ๊ฐ’๊ณผ ์—ฐ๊ณ„๋œ ์ •๋„๊ฐ€ ์ž‘์€ ํŠน์ง•์„ ๋ฒ„๋ฆผ์œผ๋กœ์จ ์ฐจ์› ์ถ•์†Œ์™€ ์—ฐ์‚ฐ ์†๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ณ  ์ง๊ต ํŠน์ง•์„ ์–ป์–ด๋‚ผ ์ˆ˜ ์žˆ์Œ. ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ํ•ด์„ํ•˜๋Š”๋ฐ ๋„์›€์ด ๋œ๋‹ค.
  • PCA๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ label์— ๋Œ€ํ•œ ์ง€์‹ ์—†์ด (Unsupervised Learning)์„ ํ†ตํ•ด ์–ด๋–ค ํŠน์ง•์ด ๋‹ค๋ฅธ ํŠน์ง• ๋ณด๋‹ค ์ฃผ์š”(Principal)ํ•˜๋‹ค๋Š” ๊ฒฐ์ •์„ ๋‚ด๋ฆฐ๋‹ค. ์ฆ‰, ๊ณผ์ ํ•ฉ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ
  • PCA ๊ฒ€์ฆ ํ™œ์šฉ ๋ฐฉ๋ฒ•
    • ๋ชจ๋“  feature๊ฐ€ random์ด๋ผ๋ฉด PCA์˜ ์ˆœ์œ„์™€ MDI,MDA,SFI ์ˆœ์œ„์™€ ์ผ์น˜ํ•˜์ง€ ์•Š์„ ๊ฒƒ์ด๋‹ค. ์œ ์‚ฌํ•  ์ˆ˜๋ก ๊ณผ์ ํ•ฉ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ์Œ์„ ์˜๋ฏธํ•œ๋‹ค๊ณ  ํ•ด์„ ๊ฐ€๋Šฅ
    • egien values (inverse of pca rank) ~~ mdi,mda,sfi rank ์˜ weighted Kendallโ€™s tau๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ณผ์ ํ•ฉ ์ƒํƒœ๋ฅผ ์ฒดํฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Parallelized VS. Stacked Feature Importance

Parallelized

  • image
  • image image
  • image
    • small lambda : ๊ฐ ์ƒํ’ˆ i, ๊ธฐ์ค€(train data-label data) k์— ๋Œ€ํ•œ ํŠน์ง• ์ค‘์š”๋„ j๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์ด๋ฅผ ๋ณ‘ํ•ฉํ•ด large lambda(j,k) ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์žˆ๊ณ  ํˆฌ์ž ์ƒํ’ˆ ์˜์—ญ ์ „๋ฐ˜์— ๊ฑธ์ณ ์ค‘์š”ํ•  ์ˆ˜๋ก ๊ธฐ์ € ํ˜„์ƒ(theoritical mechanism) ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค.
    • ๋ณ‘๋ ฌ์ ์œผ๋กœ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๊ณ  ๊ฐ ์ƒํ’ˆ ๋ณ„๋กœ ํŠน์ง• ์ˆœ์œ„๊ฐ€ ๋ฐ”๋€” ์ˆ˜ ์žˆ์œผ๋‚˜ ์ด๋ฅผ ํ‰๊ท ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋งŒ, ์ƒํ’ˆ์— ๋Œ€ํ•ด ํ‰๊ท ํ™” ํ•˜๋Š” ๋ฐฉ์‹์€ ๊ฒฐํ•ฉํšจ๊ณผ๋ฅผ ์ผ๋ถ€ ๋†“์น˜๊ฒŒ ๋œ๋‹ค.

Stacked

  • image

    • ์„œ๋กœ ๋‹ค๋ฅธ ์ƒํ’ˆ์„ ํ•˜๋‚˜์˜ dataset์œผ๋กœ stacking ํ•˜๋Š” ๋ฐฉ์‹์ด๊ณ  ์ด๋• X'๋Š” standardized on a rolling trailing window ์—ฌ์•ผ ํ•œ๋‹ค.
      • X๊ฐ€ IID ํ•˜๋ฉด X'๋„ IID
    • Parallelized version ์— ๋น„ํ•ด ํ›จ์”ฌ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์— ์ ํ•ฉํ™” ๋˜๊ณ , ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์€ ๊ฐ„๋‹จํ•ด์ง„๋‹ค.
    • ์ด์ƒ ๊ฐ’๊ณผ ๊ณผ์ ํ•ฉ์— ์˜ํ•œ ํŽธํ–ฅ์ด ์ ๋‹ค.
    • ๊ฒฐํ•ฉํšจ๊ณผ๋ฅผ ๋†“์น˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.
    • Stacking์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ์—„์ฒญ๋‚œ ๋ฉ”๋ชจ๋ฆฌ์™€ ์ž์›์„ ์†Œ๋ชจํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ HPC(High Performance Computing) ์ด ๋งค์šฐ ์ค‘์š”ํ•ด์ง
  • Feautre Importance๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹

Application to Quant System

  • PCA feature ์ค‘ ๊ฐ€์žฅ ์ฃผ์š”ํ•œ feature๋Š” feature๋กœ ์ถ”๊ฐ€ํ•ด๋ด๋„ ๋  ๋“ฏ. ๋ฌผ๋ก  redundancy๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒ ์ง€๋งŒ.
  • feature importance ๊ณ„์‚ฐ์—” multi-processing์„ ๋ถ™์—ฌ์•ผ ํ•œ๋‹ค.
    • labeling case๊ฐ€ ๋„ˆ๋ฌด ๋‹ค์–‘ํ•จ.
    • backtest๋กœ ๋ฐ”๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์— model ๋‹จ์—์„œ ๋งŽ์€ ๊ฒƒ์„ ๋๋‚ด์„œ backtesting ์ชฝ์˜ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ตœ์†Œํ•œ์œผ๋กœ ์ค„์—ฌ์ค˜์•ผ ํ•จ
  • stacking ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•˜๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.