Cross Validataion in Model - jaeaehkim/trading_system_beta GitHub Wiki

Motivation

  • CV์˜ ๋ชฉ์ ์€ Test Data๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜์—ฌ ML ๋ชจ๋ธ์˜ ์ผ๋ฐ˜์ ์ธ ์˜ค์ฐจ๋ฅผ ์•Œ์•„๋‚ด์–ด ๊ณผ์ ํ•ฉ์„ ๋ง‰๋Š” ๊ฒƒ์ด๋‹ค. K-Fold์˜ ๊ฒฝ์šฐ Test Data ๊ตฌ๊ฐ„์„ ์—ฌ๋Ÿฌ ๋ฒˆ ๋Œ์•„๊ฐ€๋ฉฐ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ํŽธ์ค‘์„ ๋ง‰์•„ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ณ  ์—ฌ๋Ÿฌ ๊ตฌ๊ฐ„์„ ๋‚˜๋ˆ„์–ด ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ IIDํ•ด์•ผ ํ•˜๋Š” ํŠน์ง•์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
  • ๊ธˆ์œต์—์„œ CV๋ฅผ ์ ์šฉํ•  ๋•Œ ์–ด๋–ค ์ด์Šˆ๊ฐ€ ์žˆ๋Š” ์ง€๋ฅผ ์—ฐ๊ตฌ

The Goal of Cross-Validation

image

  • ML์˜ ๋ชฉ์  : ๋ฐ์ดํ„ฐ์˜ ์ผ๋ฐ˜ ๊ตฌ์กฐ๋ฅผ ์•Œ์•„๋‚ด์„œ unseen feature๋ฅผ ๋ณด๊ณ  ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•จ
  • Test Data๋ž€ ๋ถ€๋ถ„๋„ Train์„ ์‹œํ‚จ๋‹ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์š”์•ฝํ•˜๋Š” ๋Šฅ๋ ฅ๋งŒ ์ฆ๊ฐ€ํ•  ๋ฟ ์˜ˆ์ธก๋ ฅ์€ ์ „ํ˜€ ์—†๊ฒŒ ๋จ.
  • ๋ฐ์ดํ„ฐ๊ฐ€ IID ํ•˜๋‹ค๋Š” ์ „์ œ ํ•˜์—์„œ K-Fold๋Š” ML ๋ชจ๋ธ์„ ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฒ€์ฆํ•˜๊ธฐ์— ์ข‹์€ Tool์ด ๋จ.
  • CV ์ •ํ™•๋„๋Š” ๊ฐ Fold ๋ณ„์˜ Test Metric์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋•Œ, 1/2๋ฅผ ๋„˜๋Š”๋‹ค๋ฉด ML Model์ด ๋ฌด์—‡์ธ๊ฐ€ ํ•™์Šตํ–ˆ๋‹ค๊ณ  ๊ฐ„์ฃผํ•œ๋‹ค.
  • ML Model์˜ CV๋Š” Hyper Parameter Tuning์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” Backtesting์ด๋‹ค.

Why K-Fold CV Fails in Finance

  • ๋ชจ๋ธ ๊ฐœ๋ฐœ ๊ด€์ : IID Process๋ฅผ ๋”ฐ๋ฅธ Data๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ฑฐ์˜ ์—†๊ธฐ ๋•Œ๋ฌธ
    • Data Structures, Fractionally Differentiated Features๋ฅผ ํ†ตํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ IIDํ™” ์‹œํ‚ค๋ ค๊ณ  '์ด๋ฏธ' ๋…ธ๋ ฅํ•จ.
    • CV์—์„  Test Data๊ฐ€ ์—ฌ๋Ÿฌ ๊ตฌ๊ฐ„์—์„œ ์ƒ๊ธฐ๋ฉด์„œ ์‹œ๊ณ„์—ด ์ƒ๊ด€์„ฑ์œผ๋กœ ์ธํ•ด ์ •๋ณด๊ฐ€ ์•ฝ๊ฐ„ ๋ˆ„์ถœ๋  ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์„ ๋ฐฉ์ง€ํ•˜๊ณ ์ž Purge, Embargo๋ฅผ ์ œ์‹œํ•จ.
  • ๋ฐฑํ…Œ์ŠคํŒ… ๊ด€์ : ๋ฐฑํ…Œ์ŠคํŠธ์˜ ๋ชฉ์ ์ธ ๋‚˜์œ ๋ชจ๋ธ์„ ํ๊ธฐํ•˜๋Š” ๊ฒƒ์— ์“ฐ๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋ชจ๋ธ ์ž์ฒด๋ฅผ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ

Information Leakage

  • image
    • Information Leakage๋Š” Train ๋ฐ์ดํ„ฐ๊ฐ€ Test ๋ฐ์ดํ„ฐ์—๋„ ๋“ฑ์žฅํ•˜๋Š” ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ๊ฒฝ์šฐ์— ๋ฐœ์ƒํ•จ. ๋‹ค๋งŒ, Train/Test๋ฅผ ๋ฌผ๋ฆฌ์ ์œผ๋ก  ์ •ํ™•ํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ์ˆ ์  ์˜ค๋ฅ˜๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ์ด๋Ÿฐ ๋ถ€๋ถ„์€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Œ. ๋‹ค๋งŒ, ๊ณ„์—ด ์ƒ๊ด€ ํŠน์ง•์— ์˜ํ•ด ์œ„์™€ ๊ฐ™์ด X_t ~ X_(t+1) , Y_t ~ Y_(t+1)์ธ ๊ฒฝ์šฐ์— ๋ฌผ๋ฆฌ์ ์ธ index์˜ ๋งˆ์ง€๋ง‰์ด t๊นŒ์ง€๋ผ๋ฉด t+1์ด Test๋กœ ๋ถ„๋ฅ˜๋˜๊ณ  ์ด๋Š” '์‚ฌ์‹ค์ƒ' Inforamtion Leakage๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ.
    • image image
      • image : X๊ฐ€ ๋ฌด๊ด€ํ•œ ํŠน์ง•์ด๋ผ ํ•˜๋”๋ผ๋„ Expectation ๊ฐ’์ด ์‚ฌ์‹ค์ƒ Y_(t+1)๋กœ ์ˆ˜๋ ดํ•˜๊ฒŒ ๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋จ.
        • image image image
          • Information Leakage๋กœ ํŒ๋‹จํ•˜๋ ค๋ฉด (Xt,Yt) ~ (X_(t+1), Y_(t+1))์ด ๋˜์–ด์•ผ ํ•˜๊ณ  ํ•œ์ชฝ๋งŒ ๋งž์•„์„œ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๋‹ค.

A Solution : Purged K-Fold CV

image

  • Purging(Delete Overlap) : Train/Test ๋ฐ์ดํ„ฐ์˜ 'Label'์ด ์ค‘์ฒฉ๋œ ๊ฒฝ์šฐ์˜ Observation์„ ๋ชจ๋‘ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ > ์ด ๋ถ€๋ถ„์€ Labeling์œผ๋กœ ์ธํ•œ ์ค‘์ฒฉ ์š”์†Œ ์ผ๊ด„์ ์œผ๋กœ ์ œ๊ฑฐ
  • Embargo : Purging์„ ํ•˜๊ณ  ๋‚˜์„œ ๋ถ€๊ฐ€์ ์ธ ์ž‘์—…์œผ๋กœ ์ด๋Š” ์‹œ๊ณ„์—ด ์ƒ๊ด€์„ฑ ์ •๋ณด๊ฐ€ ๋‚จ์•„์žˆ๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์–ด์„œ ์‹œ๊ณ„์—ด ํ๋ฆ„ ์ƒ ํ…Œ์ŠคํŠธ ์งํ›„์˜ ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์ผ์ • ๊ตฌ๊ฐ„ ๋ณด์ˆ˜์ ์œผ๋กœ ์‚ญ์ œํ•˜๋Š” ํ–‰์œ„ > ์ข€๋” ์ง„๋ณด์ ์œผ๋ก  ์–ด๋””๊นŒ์ง€๋ฅผ ํ˜๋ €๋Š”์ง€๋ฅผ ๋ฏธ์„ธํ•˜๊ฒŒ ์ฒดํฌํ•ด์„œ ํ•  ์ˆ˜๋„ ์žˆ์„๋“ฏ.

Purging the Trainig Set

  • image image
    • ๊ฒฐ๊ตญ ๋ชฉ์ ์€ Information Set I(Train),J(Test)๊ฐ„์˜ ๊ต์ง‘ํ•ฉ์„ ์—†์• ๋Š” ๊ฒƒ์ด๊ณ 
  • image
    • I,J๊ฐ€ ๊ฒน์น˜๋Š” ๊ตฌ๊ฐ„์€ Triple Barrier ์ฐธ๊ณ ๋ฅผ ํ†ตํ•ด ์œ„์˜ 3๊ฐ€์ง€ ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•จ
def getTrainTimes(t1, testTimes):
    trn = t1.copy(deep=True)
    for i, j in testTimes.iteritems():
        df0 = trn[(i <= trn.index) & (trn.index <= j)].index  # Train starts within test
        df1 = trn[(i <= trn) & (trn <= j)].index  # Train ends within test
        df2 = trn[(trn.index <= i) & (j <= trn)].index  # Train envelops test
        trn = trn.drop(df0.union(df1).union(df2))
    return trn

Embargo

  • image
    • Purge๋กœ ๋ˆ„์ˆ˜ ๋ฐฉ์ง€ ๋ชปํ•œ ๊ฒฝ์šฐ์— ์‚ฌ์šฉํ•˜๊ณ  ์‹œ๊ณ„์—ด ํ๋ฆ„์˜ ์ •๋ณด๊ฐ€ ๋ˆ„์ถœ๋œ ๊ฒฝ์šฐ๋งŒ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ Train - (1) - Test - (2) - Train์ผ ๋•Œ (2)์— ํ•ด๋‹นํ•˜๋Š” ๊ตฌ๊ฐ„์—๋งŒ ์ ์šฉํ•จ. ๊ตฌ๊ฐ„์˜ ๊ธธ์ด๋Š” ์ž„์˜๋กœ ์ •ํ•ด์ง„๋‹ค. ๋Œ€๋žต์ ์œผ๋ก  ์ž‘์€ ๊ฐ’ 0.01T ์ •๋„๋กœ ๋งค์šฐ ์ถฉ๋ถ„ํ•˜๋‹ค๊ณ  ํ•จ.
def getEmbargoTimes(times, pctEmbargo):
    step = int(times.shape[0] * pctEmbargo)
    if step == 0:
        mbrg = pd.Series(times, index=times)
    else:
        mbrg = pd.Series(times[step:], index=times[:-step])
        mbrg = mbrg.append(pd.Series(times[-1], index=times[-step:]))
    return mbrg

The Purged K-Fold Class

class PurgedKFold(_BaseKFold):
    def __init__(self, cv_config, tl=None):
        self._cv_config = cv_config
        n_splits = self._cv_config.get('n_splits', 3)
        pctEmbargo = self._cv_config.get('pctEmbargo', 0.)
        if not isinstance(tl, pd.Series):
            raise ValueError('Label Through Dates must be a pd.Series')
        super().__init__(n_splits=n_splits, shuffle=False, random_state=None)
        self.tl = tl
        self.pctEmbargo = pctEmbargo

    def split(self, X, y=None, groups=None):
        if False in X.index == self.tl.index:
            raise ValueError('X and ThruDateValues must have the same index')
        indices = np.arange(X.shape[0])
        mbrg = int(self.pctEmbargo*X.shape[0])
        test_starts = [(i[0], i[-1]+1) for i in np.array_split(indices, self.n_splits)]
        for i, j in test_starts:
            t0 = self.tl.index[i]
            test_indices = indices[i:j]
            maxTlIdx = self.tl.index.searchsorted(self.tl[test_indices].max())
            train_indices = self.tl.index.searchsorted(self.tl[self.tl <= t0].index)
            if maxTlIdx < X.shape[0]:
                train_indices = np.concatenate([train_indices, indices[maxTlIdx+mbrg:]])
            yield train_indices, test_indices
  • self.tl์€ pd.Series๋กœ index,value๋Š” ๊ฐ๊ฐ Triple Barrier์˜ ์‹œ์ž‘์ ๊ณผ ๋์ ์„ ๋‚˜ํƒ€๋ƒ„.
  • PurgedKFold ํด๋ž˜์Šค์˜ split method ๋ถ„์„
    • mbrg๋Š” ์— ๋ฐ”๊ณ ๋ฅผ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธธ์ด๋ฅผ ๋ฏธ๋ฆฌ percentage๋ฅผ ์ด์šฉํ•ด ๊ธธ์ด ์‚ฐ์ถœ
    • maxTlIdx๋Š” purging์œผ๋กœ **train - (1) - test - (2) - train์—์„œ (2)**ํŒŒํŠธ์˜ purging ๋ถ€๋ถ„ idx๋ฅผ ๊ณ„์‚ฐ
    • train_indices๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์—์„œ ์œ„์˜ (1) ํŒŒํŠธ์˜ purging์„ ํ•ด๋ฒ„๋ฆฐ train_indices๋ฅผ ์‚ฐ์ถœ
    • np.concatenate([train_indices, indices[maxTlIdx+mbrg:]]) ์„ ํ†ตํ•ด (1),(2)์˜ purging, embargo๋ฅผ ์ ์šฉํ•œ ์ตœ์ข… train indices๋ฅผ ์‚ฐ์ถœ

sklearn bug

def cvScore(clf, X, y, sample_weight=None, scoring='neg_log_loss', t1=None, cv=None, cvGen=None, pctEmbargo=None):
    if scoring not in ['neg_log_loss', 'accuracy']:
        raise Exception('Wrong scoring method')
    if cvGen is None:
        cvGen = PurgedKFold(n_splits=cv, t1=t1, pctEmbargo=pctEmbargo)
    if sample_weight is None:
        sample_weight = np.ones(len(X))

    score = []
    for train, test in cvGen.split(X=X):
        fit = clf.fit(
            X=X.iloc[train, :], y=y.iloc[train],
            sample_weight=sample_weight[train]
        )
        if scoring == 'neg_log_loss':
            prob = fit.predict_proba(X.iloc[test, :])
            score_ = -log_loss(y.iloc[test], prob, sample_weight=sample_weight[test], labels=clf.classes_)
        else:
            pred = fit.predict(X.iloc[test, :])
            score_ = accuracy_score(y.iloc[test], pred, sample_weight=sample_weight[test])
        score.append(score_)
    return np.array(score)
  • ์ฝ”๋“œ ์„ค๋ช…
    • clf.fit์„ ํ†ตํ•ด ๋”ฐ๋กœ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  scoring input์— ๋”ฐ๋ผ predict/predict_proba์™€ metric function์ธ log_loss, accuracy_score๋ฅผ ํ™œ์šฉํ•˜์—ฌ score๋ฅผ ๋”ฐ๋กœ ๊ณ„์‚ฐ
    • model์„ ๊ฒ€์ฆํ•  ๋•Œ ์ฃผ๋กœ ์“ฐ๋ฏ€๋กœ feature importance ์ชฝ ๋ชจ๋“ˆ๊ณผ ์—ฐ๊ณ„