Sample Weights - jaeaehkim/trading_system_beta GitHub Wiki

Motivation

  • ๋งŽ์€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฌธํ—Œ์€ ๋ฐ์ดํ„ฐ๊ฐ€ IID ๋ผ๋Š” ๊ฐ€์ •ํ•˜์—์„œ ์ง„ํ–‰ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋ฉด ์ด๋Ÿฐ ๊ฐ€์ •์€ ๋น„ํ˜„์‹ค์ 
  • ์œ„์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ Data Structures, Labeling ์„ ํ†ตํ•ด ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ์Œ

issue

  • Labeling ์—์„œ ์ œ์•ˆํ•˜ Triple Barrier๋ฅผ ํ™œ์šฉํ•™ ๋˜๋ฉด Overlapping outcomes ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋จ

  • i์™€ j ๋‘ ๊ฐœ์˜ ๊ตฌ๊ฐ„์—์„œ ๊ฐ๊ฐ feature(X), label(y)์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •.

    • 4 2_1
    • 4 2_2
    • ์œ„์™€ ๊ฐ™์€ ์กฐ๊ฑด์—์„  y_i, y_j๋Š”์œ„์˜ return์— ์ข…์†๋˜๊ฒŒ ๋จ. ์ฆ‰ y_i๊ฐ€ i={1... I} ๊นŒ์ง€ ์žˆ์„ ๋•Œ ์—ฐ์† ๊ฒฐ๊ณผ์—์„œ ์ค‘์ฒฉ์ด ์ง€์†์ ์œผ๋กœ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์–˜๊ธฐ์ด๊ณ  ์ด๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ IID์™€ ๋ฉ€์–ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ์–˜๊ธฐํ•จ.
      • ์—„๋ฐ€ํ•˜๊ฒŒ๋Š” ํ•˜๋‚˜์˜ ์ค‘์ฒฉ๋งŒ ์žˆ์–ด๋„ IID๊ฐ€ ์•„๋‹˜.

solution

  • sol(1) : ๋ฒ ํŒ… ๊ฐ€๋Šฅํ•œ horizon์„ ์ œํ•œํ•˜๋Š” ๋ฐฉ๋ฒ•
    • 4 2_3 4 2_4
    • ์ค‘์ฒฉ๋งŒ์„ ์œ„ํ•ด ์ผ๋ถ€ train data๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜์ง€๋งŒ ์ด์ „์˜ ๋ฐ์ดํ„ฐ์— ๊ด€ํ•œ ์‹ฌ๋„ ๊นŠ์€ ์ „์ฒ˜๋ฆฌ ์ž‘์—…๋“ค์— ํฐ ํ•ด๋ฅผ ๋ผ์น˜๋ฏ€๋กœ ์ข‹์€ ๋ฐฉ๋ฒ•์€ ์•„๋‹˜.
  • sol(2) : ํ‰๊ท  ๊ณ ์œ ์„ฑ(average uniqueness) ํ™œ์šฉ
    • train data์˜ ๋ชจ๋“  row๋ฅผ ํ™œ์šฉํ•ด์„œ model์— ๋„ฃ์–ด์ฃผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ average uniqueness๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ•ด๋‹น ๋น„์œจ๋งŒํผ sampling ํ•˜์—ฌ ์ค‘์ฒฉ ํšจ๊ณผ๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹
  • sol(3) : ์ˆœ์ฐจ์  ๋ถ€ํŠธ์ŠคํŠธ๋žฉ(sequential bootstrap) ํ™œ์šฉ
    • train data์˜ row๋ฅผ samplingํ•  ๋•Œ ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๋ฉฐ 1ํšŒ ์ง„ํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๋‚˜๋จธ์ง€ row๋“ค์— ๋Œ€ํ•œ ์ถ”์ถœ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๋ณ€๊ฒฝํ•ด๊ฐ€๋ฉด์„œ ๋‹ค์Œ ์ถ”์ถœ์„ ์ด์–ด๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹.

Number of Concurrent Labels

Concept

  • ๊ณต์กด(Concurrent)์˜ ์ •์˜ : ๋‘ ๊ฐœ์˜ label์ด ํ•˜๋‚˜์˜ ์ค‘์ฒฉ ์ˆ˜์ต๋ฅ ์„ ๊ฐ€์ง€๋Š” ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธํ•จ. ์œ„์—์„œ ์„ค๋ช…ํ–ˆ๋“  ์‹œ๊ฐ„ ๊ตฌ๊ฐ„์ด ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ผํ•˜์ง„ ์•Š์•„๋„ ๋˜๊ณ  ์ค‘์ฒฉํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๊ณต์กดํ•œ๋‹ค๊ณ  ํ‘œํ˜„.
    • 4 3_1 4 3_2
  • ๊ณต์กด(Concurrent)์˜ ์ˆ˜์‹ ํ‘œํ˜„
    • 4 3_3
    • 4 3_4
    • 4 3_5
    • 4 3_6
      • t๋Š” ์‹œ๊ฐ„ ํฌ์ธํŠธ๋ฅผ ์˜๋ฏธํ•˜๊ณ  i๋Š” ๋ ˆ์ด๋ธ”(train data์˜ row)์˜ ๋„˜๋ฒ„๋ง์„ ์˜๋ฏธํ•œ๋‹ค. t์™€i๋ฅผ ๊ตฌ๋ณ„ํ•˜์—ฌ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”.
        • i์—๋Š” triple barrier๋ฅผ ์œ„ํ•œ [์‹œ์ž‘์  ์‹œ๊ฐ„, ๋์  ์‹œ๊ฐ„]์ด ์Œ“์—ฌ์žˆ๊ณ  t๋Š” i์— ์Œ“์—ฌ์žˆ๋Š” ์‹œ๊ฐ„์„ ํ•˜๋‚˜์˜ ๋ฆฌ์ŠคํŠธ๋กœ flatํ•˜๊ฒŒ ๋งŒ๋“  ํ›„ sortํ•œ ์ง‘ํ•ฉ์˜ ์›์†Œ
      • {1_(t,i)}_ i๋ผ๋Š” ์ด์ง„ ํ–‰๋ ฌ(T x I)์„ ๊ตฌ์„ฑํ•˜๊ณ  ๋ฒกํ„ฐ 1_(t,i) (1 X T) ๋Š” [t_(i,0) , t_(i,1)]๊ณผ [t-1, t]๊ฐ€ ์ค‘์ฒฉํ•˜๋Š” ๊ฒฝ์šฐ(=triple barrier์˜ ์‹œ๊ฐ„ ๊ตฌ๊ฐ„)์—๋Š” ํ•ด๋‹น i th ์ด์ง„ ํ–‰ ๊ฐ’์— 1์„ ๋ถ€์—ฌํ•˜๊ณ  ์ค‘์ฒฉํ•˜์ง€ ์•Š์œผ๋ฉด 0์„ ๋ถ€์—ฌํ•œ๋‹ค. ๋ฒกํ„ฐ ๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๊ฒฝ์šฐ์—” i๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ๊ณ  t๊ฐ€ 1-T๊นŒ์ง€ ๋ณ€ํ•˜๋ฉด์„œ ์ฒดํฌ
        • ํ–‰๋ ฌ{1_(t,i)} : 4 3_11
      • ์œ„์˜ ์ž‘์—…์„ ํ•˜๊ณ  ๋‚˜๋ฉด I๊ฐœ์˜ ๋ฒกํ„ฐ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ณ  ์ด๊ฒƒ์„ ํ™œ์šฉํ•ด '๊ณต์กด ๋ ˆ์ด๋ธ” ๊ฐœ์ˆ˜(the number of labels concurrent)'๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ (ํ–‰๋ ฌ์˜ ํ–‰์˜ ํ•ฉ)
      • 4 3_7
        • ์‹œ์  t ํ•˜๋‚˜์— ๋Œ€ํ•ด์„  ๊ทธ ์‹œ๊ฐ„์— triple barrier๊ฐ€ ๋ช‡๊ฐœ๊ฐ€ ๋™์‹œ์— ๊ณต์กดํ•˜๊ณ  ์žˆ๋Š” ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์ด๋Š” C_(t=10) = 4 ์ด๋Ÿฐ ์‹์ด ๋  ๊ฒƒ์ด๋‹ค. t=1~T๊นŒ์ง€๋Š” C_t ๋ฒกํ„ฐ๊ฐ€ ๋œ๋‹ค. (compute c_t for t in price.index)

Code

def mpNumCoEvents(priceIdx, tl, molecule):
    """
    refer to the explanation on Snippet 4.1
    :param priceIdx:
    :param tl: time limit
    :param molecule: events.index if numThreads=1 or subset otherwise
    :return: compute c_t for t in price.index
    """

    tl = tl.fillna(priceIdx[-1])
    tl = tl[tl >= molecule[0]]
    tl = tl.loc[:tl[molecule].max()]

    iloc = priceIdx.searchsorted(np.array([tl.index[0], tl.max()]))
    count = pd.Series(0, index=priceIdx[iloc[0]:iloc[1]+1])
    for tIn, tOut in tl.iteritems():
        count.loc[tIn:tOut] += 1

    return count.loc[molecule[0]:tl[molecule].max()]

Average Uniqueness of a Label

Concept

  • '๊ณต์กด(Concurrent)'์˜ ๊ฐœ๋…์„ ์ •์˜ํ–ˆ๊ณ  ์ด๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜์˜€๋‹ค. ๋ชฉ์ ์€ train data์— ์ค‘์ฒฉ์„ฑ์ด ์กด์žฌํ•˜๊ณ  ์ด ์ค‘์ฒฉ์„ฑ์„ '๊ณต์กด ๋ ˆ์ด๋ธ” ๊ฐœ์ˆ˜(the number of labels concurrent)'๋กœ ํ‘œํ˜„ํ–ˆ๋‹ค. ์ด๊ฒƒ์„ ์‹œ๊ฐ t, ๋ ˆ์ด๋ธ” i์—์„œ์˜ '๊ณ ์œ ์„ฑ(Uniqueness)'์„ ํ‘œํ˜„ํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • 4 3_8

    • u_(t,i) & 1_(t,i) ๋Š” scalar, c_t๋Š” vector
  • triple barrier ๊ตฌ๊ฐ„ ์•ˆ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ u_(t,i)๊ฐ€ ์กด์žฌํ•˜๋ฏ€๋กœ ์ด๋ฅผ 'ํ‰๊ท  ๊ณ ์œ ์„ฑ(Average Uniqueness)'์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ ํ•  ์ˆ˜ ์žˆ๋‹ค. 4 3_9

  • issue

    • ํ‰๊ท  ๊ณ ์œ ์„ฑ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฏธ๋ž˜์˜ ์•Œ ์ˆ˜ ์—†๋Š” ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋จ
      • ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ์ด์œ  : 1) training set์—์„œ๋งŒ ํ™œ์šฉ๋˜๊ณ  test set์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹˜. 2) ํ‰๊ท  ๊ณ ์œ ์„ฑ์ด ๋ ˆ์ด๋ธ” ์˜ˆ์ธก์— ์ง์ ‘ ์‚ฌ์šฉ๋˜์ง„ ์•Š์Œ.
      • ์ •๋ง ์•„์˜ˆ ๋ˆ„์ˆ˜(leakage)๊ฐ€ ์—†๋Š” ๊ฑธ๊นŒ? ๊ฐœ์ธ์ ์œผ๋กœ ๊ทธ๋ ‡์ง„ ์•Š๋‹ค๊ณ  ๋ด„. but ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ IIDํ™” ํ•œ ๊ฒƒ + ์ถ”ํ›„์— ๋ฐฑํ…Œ์ŠคํŠธ๋ฅผ ํ•  ๋•Œ ํผ์ง€-์— ๋ฐ”๊ณ (purge-embargo)๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์˜ํ–ฅ๋„๊ฐ€ ์ ๊ธฐ์— ๋ฌด์‹œํ•ด๋„ ๋œ๋‹ค๋กœ ํ•ด์„.

Code

def mpSampleTW(tl, numCoEvents, molecule):
    """
    :param tl: events['tl']
    :param numCoEvents: pd.Series with index 
    :param molecule: events.index if numThreads=1 or subset otherwise
    :return: compute u_i_bar for i in events.index
    """

    wght = pd.Series(index=molecule)

    for tIn, tOut in tl.loc[wght.index].iteritems():
        wght.loc[tIn] = (1./numCoEvents.loc[tIn:tOut]).mean()

    return wght
  • ๋น„๊ต : price.index vs event.index(=molecule) vs tl (triple barrier : [[t1,t2],[t3,t4]....]
  • numCoEvents.loc[tIn:tOut]
    • image
  • .mean()
    • image

Bagging Classifiers and Uniqueness

Application of average uniquensess

  • Bagging ์•Œ๊ณ ๋ฆฌ์ฆ˜ : n ํฌ๊ธฐ์˜ train set 'A'๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •. m๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ๋ณต์› ์ถ”์ถœ ์ง„ํ–‰. ์ถ”์ถœํ•  ๋•Œ ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ท ๋“ฑ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•˜์—ฌ A_i ํฌ๊ธฐ n' (n' < n)์„ ๊ฐ–๋Š” train set m๊ฐœ ์ƒ์„ฑ. m๊ฐœ์˜ train set์œผ๋กœ m๊ฐœ์˜ model์„ ์ƒ์„ฑํ•˜๊ณ  ์ด๋ฅผ averaging(regression case)/voting(classification case) ํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ ๊ฐ’์„ ์‚ฐ์ถœ
    • Issue
      • ์œ„์—์„œ ์–˜๊ธฐํ•˜๋Š” Bagging ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ์„ค๋ช…์€ ML์ชฝ์—์„œ ๋ฐ์ดํ„ฐ๋“ค์ด IIDํ•˜๋‹ค๋Š” ๊ฐ€์ •์„ ํ•˜๊ณ  ์ด์•ผ๊ธฐํ•œ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋ธ์ด์ง€๋งŒ ๊ทผ์›์ ์œผ๋ก  ๊ฐ™์€ source์ธ A์—์„œ ๋‚˜์˜จ bootstrap sample์„ ์ด์šฉํ•œ ๊ฒƒ์ด๋ฏ€๋กœ '์ค‘๋ณต์„ฑ'์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.
      • ๋‹ค๋งŒ, source๊ฐ€ IID ํ•  ๋•Œ๋Š” 100๊ฐœ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋„ ์ค‘๋ณต์„ฑ์ด ์‹ฌ๊ฐํ•˜์ง€ ์•Š์ง€๋งŒ source๊ฐ€ IID์™€ ๋ฉ€์–ด์งˆ ์ˆ˜๋ก ์†Œ์ˆ˜์˜ ๋ชจ๋ธ๋กœ๋งŒ์œผ๋กœ๋„ ์ด๋ฏธ ์ค‘๋ณต์„ฑ์ด ์‹ฌ๊ฐํ•ด์ง€๊ณ  ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ๋Ÿ‰์— ๋น„ํ•ด ์ตœ์ข… output ๊ฐ’์˜ ์œ ํšจ์„ฑ์€ ๋ณ€ํ™”๊ฐ€ ์—†๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๋ณด์œ ํ•˜๊ณ  ์žˆ๋Š” train set์„ ๊ทน๋Œ€ํ™” ์‹œ์ผœ์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ train set์„ ์ตœ๋Œ€ํ•œ IIDํ•œ ์ƒํƒœ๋กœ ๋งŒ๋“ค์–ด ๋†“๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ.
    • Solution
      • ํ‰๊ท  ๊ณ ์œ ์„ฑ(average uniqueness)์„ ํ™œ์šฉํ•œ 1์ฐจ์ ์ธ ํ•ด๊ฒฐ์ฑ… ๋ณด์œ ํ•˜๊ณ  ์žˆ๋Š” train set์˜ ํ‰๊ท ์ ์ธ ๊ณ ์œ ์„ฑ์„ ์ˆ˜์น˜ํ™”ํ•œ ํ›„ ํ•ด๋‹น ๋น„์œจ์˜ ํฌ๊ธฐ๋กœ๋งŒ bootstrap sample์„ ๋งŒ๋“ค์–ด ๋‚ด๋„๋ก ์ œํ•œํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
      • ์˜ˆ๋ฅผ๋“ค๋ฉด, sklearn์˜ BaggingClassifier์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ค‘ max_samples(๋น„์œจ)์˜ ๊ฐ’์œผ๋กœ ํ‰๊ท ๊ณ ์œ ์„ฑ(average uniqueness)์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. sklearn.ensemble.BaggingClassifier ์ฐธ๊ณ  ๋งํฌ

Code

self._model = self.model_generator.model(type=self._config['model_type'],
                                                 n_estimators=self._config['n estimators'],
                                                 max_samples=self.model_generator.sampleTw.mean(), oob_score=True,
                                                 random_state=self._config['bagg random state'],
                                                 base_random_state=self._config['base random state'], base_n_estimators=self._config['base n estimators'])
  • model์€ bagging๊ณผ ๋น„์Šทํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ชจ๋‘๊ฐ€ ํ•ด๋‹น.
  • max_samples ๋ถ€๋ถ„์€ self.model_generator.sampleTw.mean()์„ ํ†ตํ•ด ํ•ด๋‹น source์˜ ํ‰๊ท ๊ณ ์œ ์„ฑ์„ ๋Œ€์ž…

Sequential Bootstrap

  • Concept
    • ํ•ต์‹ฌ์€ train set์„ IIDํ™” ํ•˜๊ณ  train set์—์„œ bootstrap sample ๊ฐ„์˜ ์ค‘๋ณต์„ฑ์„ ์ตœ๋Œ€ํ•œ ๋‚ฎ์ถฐ ๋†“๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•˜๋‹ค. ์–ด๋–ค model์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ธ์ง€๋Š” ๊ทธ ๋‹ค์Œ ๋ฌธ์ œ์ด๋‹ค. ์œ„์˜ ํ•ด๊ฒฐ์ฑ…์€ '๊ท ๋“ฑ ํ™•๋ฅ  ๋ถ„ํฌ'๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ 'max_samples'๋ฅผ ์ œํ•œํ•˜๋Š” ์‹์ด์ง€๋งŒ Sequential Bootstrap์€ ๊ทผ๋ณธ์ ์œผ๋กœ 'ํ™•๋ฅ  ๋ถ„ํฌ'๋ฅผ ์† ๋ณด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
  • Sequential Bootstrap Process
    1. imageimage image
    2. image
    3. image
    4. image
    5. image
    6. image
    7. image
    8. image
    9. image
      • ์ดˆ๊ธฐ Bootstrap sampling์€ ๊ท ๋“ฑํ™•๋ฅ  ๋ถ„ํฌ์—์„œ ์‹œ์ž‘ํ•ด์•ผ ํ•จ. i์˜ ํ™•๋ฅ ์€ 1/I๋กœ ์‹œ์ž‘
      • i ์ถ”์ถœ๋กœ ์ธํ•ด ๋†’์€ ์ค‘์ฒฉ์„ฑ์„ ๋ณด์ด๋Š” ์ž„์˜์˜ X_j Bootstrap์˜ ํ™•๋ฅ ์„ ๋‚ฎ์ถ”๋ ค๊ณ  ํ•จ. (๋ฌผ๋ก , ๊ฐ€์žฅ ๋†’์€ ์ค‘์ฒฉ์„ฑ์„ ๋ณด์ด๋Š” ๊ฒƒ์€ j=i์ผ๋•Œ)
      • sampling sequence๋ฅผ ๊ธฐ๋กํ•˜๊ธฐ ์œ„ํ•ด phi๋ฅผ ์ •์˜
      • ์ž„์˜์˜ j์— ๋Œ€ํ•ด ์ถ”์ถœ๋œ i์˜ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๊ณ ์œ ์„ฑ๊ณผ ํ‰๊ท  ๊ณ ์œ ์„ฑ์„ ๊ณ„์‚ฐ (์ด์— ๋Œ€ํ•œ ์„ค๋ช…์€ ์œ„์˜ ๋ชฉ์ฐจ๋ฅผ ์ฐธ๊ณ )
      • ์ž„์˜์˜ j์˜ ํ‰๊ท  ๊ณ ์œ ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•๋ฅ  ๋ถ„ํฌ ์—…๋ฐ์ดํŠธ ๋ฐ ์Šค์ผ€์ผ๋ง
      • ๋‘ ๋ฒˆ์งธ ๊ฐฑ์‹ ๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ถ”์ถœ ์ง„ํ–‰
      • ์ด Bootstrap Sample์ด I๊ฐœ๊ฐ€ ๋  ๋•Œ๊นŒ์ง€ ์ง„ํ–‰
        • issue : Sequential Bootstrap๋„ ์—ฌ์ „ํžˆ ์ค‘์ฒฉ ๊ฐ€๋Šฅ์„ฑ์€ ์กด์žฌํ•˜์ง€๋งŒ ํ™•๋ฅ ์ ์œผ๋กœ ๊ฐ์†Œํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์„œ ๊ฐ€์„ค์ ์ธ ์ธก๋ฉด์—์„œ ๋” ์ข‹์€ ๋ฐฉ๋ฒ•์ž„.
          • MonteCarlo ์‹คํ—˜์„ ํ†ตํ•ด์„œ 2,3๋ฒˆ Solution๊ฐ„์˜ ๊ณ ์œ ์„ฑ ํผํฌ๋จผ์Šค๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Œ.

Code

def getIndMatrix(barIx, tl):
    """
    refer to chapter 4.5.3 a numerical example
    :param barIx: price.index
    :param tl: events['tl']
    :return:
    """
    indM = pd.DataFrame(0, index=barIx, columns=range(tl.shape[0]))
    for i, (t0, t1) in enumerate(tl.iteritems()):
        indM.loc[t0:t1, i] = 1

    return indM

def getAvgUniqueness(indM):
    c = indM.sum(axis=1)
    u = indM.div(c, axis=0)
    avgU = u[u>0].mean()
    return avgU

def seqBootstrap(indM, sLength=None):
    """
    refer to Chapter 4.5.3 numerical example!
    """
    if sLength is None:
        sLength = indM.shape[1]
    phi = []
    while len(phi) < sLength:
        avgU = pd.Series()
        for i in indM:
            indM_ = indM[phi+[i]]
            # ์ƒˆ๋กœ ๋„ฃ์€ i column์— ๋Œ€ํ•œ AvgUniqueness๋Š” getAvgUniquenses์— ์˜ํ•ด return๋˜๋Š” series์˜ ๋งจ ๋ ๊ฐ’์— ์žˆ์Œ!
            avgU.loc[i] = getAvgUniqueness(indM_).iloc[-1]
        prob = avgU / avgU.sum()
        phi += [np.random.choice(indM.columns, p=prob)]
        print(phi)
    return phi

Return Attribution

  • Concept
    • IID์— ๊ทผ์ ‘ํ•œ Bootstrap sample์„ ํ–ˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ์ด๋ฅผ ๋ฐ”๋กœ Model์— ํ›ˆ๋ จ ์‹œํ‚ฌ ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ ํ•œ ๋ฒˆ ๋” ์ฒ˜๋ฆฌ๋ฅผ ํ•จ์œผ๋กœ์จ Model ํ•™์Šต์˜ ์œ ํšจ์„ฑ์„ ๋” ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฒˆ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๊ฐ€์„ค์€ "๋ชจ๋“  ํ‘œ๋ณธ์˜ ์ค‘์š”์„ฑ์ด ๋™์ผํ•˜์ง€ ์•Š๋‹ค"๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ๋‹ค.
    • ์ฆ‰, ํŠน์ • ๋ ˆ์ด๋ธ” ํ–‰์€ ์ ˆ๋Œ€ ์ˆ˜์ต๋ฅ ์˜ ํฌ๊ธฐ๊ฐ€ ํฐ ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ , ๋‹ค๋ฅธ ํ–‰์€ ์ž‘์€ ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ, ์ ˆ๋Œ€ ์ˆ˜์ต๋ฅ ์˜ ํฌ๊ธฐ๊ฐ€ ํฐ ๊ฒฝ์šฐ์— ์ง‘์ค‘ํ•ด์„œ ํ•™์Šตํ•˜๋Š”๊ฒŒ ๋ฐ”๋žŒ์งํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
    • ์ˆ˜๋Šฅ ๊ธฐ์ถœ ๋ฌธ์ œ์—์„œ ๋ฐฐ์ ์ด ํฐ ํ•ญ๋ชฉ์„ ์ค‘์ ์ ์œผ๋กœ ๊ณต๋ถ€ํ•˜๋Š”๊ฒŒ ํ›จ์”ฌ ์ค‘์š”ํ•˜๋‹ค๋Š” ๋œป๊ณผ ์ผ๋งฅ์ƒํ†ตํ•œ๋‹ค. Return Attribution์€ ์ˆ˜์ต๋ฅ ์˜ ํฌ๊ธฐ๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜์—ฌ ํ‘œ๋ณธ ๊ฐ€์ค‘๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.
  • Mathematical Expression
    1. image
    2. image
    3. image
    4. image
    • triple barrier๋ฅผ ํ†ตํ•ด Labeling์„ ์ง„ํ–‰ํ–ˆ๊ณ  ์ด๋Š” life span์ด ์กด์žฌํ•˜๊ณ  ์ด๋ฅผ t_(i,0),t_(i,1)๋กœ ์ •์˜ํ–ˆ๊ณ  ํ•ด๋‹น label์— ํ‰๊ท  ๊ณ ์œ ์„ฑ(average uniqueness)๊ณผ ์Šค์ผ€์ผ๋ง์„ ๋ฐ˜์˜ํ•˜์—ฌ w_i๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ๋Œ€๋ถ€๋ถ„์˜ ML model ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์ œ๊ณตํ•˜๋Š” ๋ชจ๋ธ์€ ํ‘œ๋ณธ ๊ฐ€์ค‘ ๊ฐ’์„ ๋™์ผํ•˜๊ฒŒ default๋กœ ์ œ๊ณตํ•˜๋Š”๋ฐ ํ•ด๋‹น vector๋ฅผ parameter๋กœ ๋„ฃ์–ด์คŒ์— ๋”ฐ๋ผ ์ด๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

Code

def mpSampleW(tl, numCoEvents, price, molecule):
    """

    :param tl: events['tl']
    :param numCoEvents: c_t for t in price.index
    :param price: volume / dollar bar price
    :param molecule: events.index if numThreads=1 or subset otherwise
    :return: compute \tilde{w_i} for i in events.index (refer to formula right before snippet 4.10)
    """
    ret = np.log(price).diff()
    wght = pd.Series(index=molecule)
    for tIn, tOut in tl.loc[wght.index].iteritems():
        wght.loc[tIn] = (ret.loc[tIn:tOut] / numCoEvents[tIn: tOut]).sum()
    return wght.abs()

fit = clf.fit(train_input, train_label, sample_weight=train_weight) # train_weight = mpSampleW์˜ output
  • ํ‰๊ท  ๊ณ ์œ ์„ฑ๋งŒ returnํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๋ณ€ํ˜•ํ•˜์—ฌ ํ‰๊ท ๊ณ ์œ ์„ฑ+์ˆ˜์ต๋ฅ  ๊ฐ€์ค‘์„ ๊ณ ๋ คํ•œ ๋ฒกํ„ฐ๋ฅผ ์‚ฐ์ถœํ•˜๋Š” ํ•จ์ˆ˜

Time Decay

  • Concept
    • Adaptive Markets (Lo, A. 2017) ์˜ ๊ฐ€์„ค์ธ "๊ณผ๊ฑฐ์˜ ์˜ˆ์ œ๋Š” ์ƒˆ๋กœ์šด ๊ฒƒ๋ณด๋‹ค ์—ฐ๊ด€์„ฑ์ด ๋–จ์–ด์ง„๋‹ค"๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์ค‘๊ฐ’์„ ํ•œ ๋ฒˆ ๋” ์ค„ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ„๋‹จํ•œ ๋ชจํ˜•์ธ ์„ ํ˜•-๊ฐ์‡„ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ๋‹ค.
  • Issue
    • ๋ฐ์ดํ„ฐ๋ฅผ IIDํ™” ํ•˜์—ฌ ๋งŒ๋“œ๋Š”๋ฐ Time-Decay๋ฅผ ๊ณ ๋ คํ•œ๋‹ค๋Š” ๊ฑด ๋ชจ์ˆœ์ด ์•„๋‹Œ๊ฐ€?
      • ์ง„์ •ํ•œ IID๋ผ๋Š” ๊ฒƒ์€ ์—†๊ณ  preprocessing์„ ํ†ตํ•ด์„œ ์ตœ๋Œ€ํ•œ stationaryํ•˜๊ฒŒ ๋งŒ๋“ค์–ด IID์Šค๋Ÿฝ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ผ ๋ฟ์ด๋‹ค. IID์˜ ์ •์˜๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๊ธˆ์œต ๋ฐ์ดํ„ฐ์˜ source๋ฅผ ๋ณด๊ฑด๋Œ€ ์ง„์ •ํ•œ IID๋ฅผ ๋งŒ๋“ค ์ˆœ ์—†๋‹ค.
      • ๋‹ค๋งŒ, ๋ชจ๋ธ์˜ ํšจ์œจ์ ์ด๊ณ  ์œ ํšจํ•œ ํ•™์Šต๊ณผ output์˜ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฅผ ๊ณ ๋ คํ•ด๋ณด๋ฉด, ์•„์ง ๋œ ์ฒ˜๋ฆฌ๋œ ๋ถ€๋ถ„์ด ๋‚จ์•„์žˆ์„ ๊ฒƒ์ด๊ณ  ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์€ time-dependentํ•œ ๋ถ€๋ถ„์ด ์žˆ๋‹ค. ์ด ๋ถ€๋ถ„์— ๊ด€ํ•ด์„  Time-Decay๋ฅผ ํ†ตํ•ด์„œ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณธ๋‹ค.
    • ์‹ค์ „ application ์ธก๋ฉด์—์„œ sample ๊ฐœ์ˆ˜์— ๋”ฐ๋ผ weights๊ฐ€ ๊ณ„์† ๋ฐ”๋€Œ๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— walk-forwardํ•˜๊ฒŒ model์„ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๊ฒฝ์šฐ์— reproducibility๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์‰ฝ์ง€ ์•Š๋‹ค. ์ฆ‰, ๋ชจ๋‹ˆํ„ฐ๋ง ๊ฒฐ๊ณผ์™€ ์‹ค์ „ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ™๊ฒŒ ๋งŒ๋“ค๋ ค๋ฉด ๊ธฐ์ˆ ์ ์œผ๋กœ ๋” ๋งŽ์€ ์ž‘์—…์ด ์š”๊ตฌ๋œ๋‹ค.
      • ๊ณ ๋ฏผํ•ด๋ณด๋ฉด DB ์„ค๊ณ„๋ฅผ ํ•  ๋•Œ ์ด๋ฅผ ์œ„ํ•œ ๋ถ€๋ถ„์„ ๋งŒ๋“ค์–ด์•ผ ํ•˜๊ณ  ์ผ์ • ์‹œ๊ฐ„ ๋‹จ์œ„๋กœ weights ์ •๋ณด table์„ ๊ณ„์† ์Œ“์•„์•ผ ํ•œ๋‹ค.

Code

    def __getTimeDecay(self, tW, clfLastW=1.):
        clfW = tW.sort_index().cumsum()
        slope = (1-clfLastW) / clfW.iloc[-1] if clfLastW >= 0 else 1./((clfLastW+1)*clfW.iloc[-1])
        const = 1 - slope*clfW.iloc[-1]
        clfW = const + slope*clfW
        clfW[clfW < 0] = 0
        print(f'constant: {const:.2f}, slope: {slope:.2f}, look back sample : {len(clfW[clfW!=0])} total : {len(clfW)}')
        return clfW

Class Weights

  • Concept
    • ํ‘œ๋ณธ ๊ฐ€์ค‘๊ฐ’(sample weights) vs ๋ถ€๋ฅ˜,ํด๋ž˜์Šค ๊ฐ€์ค‘๊ฐ’ (class weights)๋ฅผ ๊ณ ๋ฏผํ•ด๋ณด์ž. ํŠน์ • class๊ฐ€ ๋ช‡ ๋ฒˆ ๋‚˜์˜ค์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์— ์œ ์šฉํ•˜๋‹ค.
    • ํŠธ๋ ˆ์ด๋”ฉ ์ธก๋ฉด์—์„  long(1), short(-1) ๋‘๊ฐœ์˜ ๋ถ€๋ฅ˜๊ฐ€ ์žˆ๊ฑฐ๋‚˜ ํ˜น์€ ์ค‘๋ฆฝ(0)๊นŒ์ง€ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•ด๋ณผ ๋•Œ, ๋ ˆ์ด๋ธ”์ด 1,-1๋งŒ ์žˆ๋‹ค๋ฉด ๊ตณ์ด class weights๋ฅผ ๊ณ ๋ คํ•  ํ•„์š”๊ฐ€ ์—†์–ด๋ณด์ธ๋‹ค. ์‹ค์ „ ๊ฒฝํ—˜์ƒ 0์€ ๊ฐ€๋” labelingํ•˜๋‹ค ๋‚˜์˜ค๋Š”๋ฐ 0์— ๊ด€ํ•ด์„  class weight๋ฅผ ๋‚ฎ๊ฒŒ ์ฃผ๋“  ๋†’๊ฒŒ ์ฃผ๋“  ์ž‘์—…์„ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.
    • ์‹ค์ „์—์„œ ์–ผ๋งˆ๋‚˜ ์œ ์šฉํ• ์ง€๋Š” ์˜๋ฌธ.