Data Structures - jaeaehkim/trading_system_beta GitHub Wiki

Motivation

  • ML Model์„ ํ•™์Šต์‹œํ‚ค๊ณ  Real World์—์„œ ์ข‹์€ ํผํฌ๋จผ์Šค๋ฅผ ๋ณด์ด๊ธฐ ์œ„ํ•ด์„  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•˜๋‹ค. ์ด๊ฒƒ์€ ์ธ๊ฐ„์ด ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋‹ค๋ฅด์ง€ ์•Š๋‹ค. ์ผ๋‹จ, ์ข‹์€ ๊ต์žฌ๊ฐ€ ์žˆ์–ด์•ผ ํ•œ๋‹ค.
  • ๊ฐ€์žฅ Rawํ•œ '๋น„๊ตฌ์กฐํ™”๋œ ๊ธˆ์œต ๋ฐ์ดํ„ฐ'์—์„œ Bar(๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ, Table์˜ row)๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ์†Œ๊ฐœํ•œ๋‹ค.
  • ํ•ต์‹ฌ์€ Bar์™€ ML model๊ฐ„์˜ ์—ฐ๊ฒฐ์ด๋‹ค. ๊ธˆ์œต์— ML์„ ์ ์šฉํ•˜๋ ค๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์‚ฌ๋žŒ๋“ค์ด ๋†“์น˜๋Š” ๋ถ€๋ถ„์€ ML Model์˜ ๋Œ€์ „์ œ์ด๋‹ค.
  • ML Model์˜ ๋Œ€์ „์ œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ IID(Independent and identically distributed)ํ•œ ์ƒํƒœ์—ฌ์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  • Bar๋ฅผ IIDํ™” ์‹œํ‚ค๋ฉด์„œ๋„ ๋ฐ์ดํ„ฐ์— ์žˆ๋Š” ์‹œ๊ทธ๋„์„ ์œ ์ง€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•œ๋‹ค.

Essential Types of Financial Data

Financial raw data

  • ๊ธฐ์ดˆ๋ฐ์ดํ„ฐ : ๊ฐ๋… ๊ธฐ๊ด€ ์ œ์ถœ ๋ฐ์ดํ„ฐ, ๋Œ€๋ถ€๋ถ„ ์žฌ๋ฌด์ œํ‘œ. ๋ฐฑํ•„๋ง(backfilled) & ์ˆ˜์ •๊ฐ’(reinstated value)์˜ ์‹ฌ๊ฐํ•œ ์˜ค๋ฅ˜๊ฐ€ ๋งŽ์•„์„œ point in time ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ค˜์•ผ ํ•จ.
  • ์‹œ์žฅ๋ฐ์ดํ„ฐ : ๊ฑฐ๋ž˜์†Œ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ. FIX ๋ฉ”์„ธ์ง€๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๊ฑฐ๋ž˜ ์‚ฌํ•ญ ์žฌ๊ตฌ์„ฑ ๊ฐ€๋Šฅ, ํ•˜๋ฃจ์— 10TB์”ฉ ์Œ“์ž„
  • ๋ถ„์„๋ฐ์ดํ„ฐ : ๊ธฐ์ดˆ, ์‹œ์žฅ, ๋Œ€์ฒด ๋˜๋Š” ๋‹ค๋ฅธ ์ข…ํ•ฉ ๋ฐ์ดํ„ฐ๋กœ ๋ถ€ํ„ฐ ํŒŒ์ƒ๋œ ๋ฐ์ดํ„ฐ ex) ์ˆ˜์ต ์˜ˆ์ธก, ๋‰ด์Šค ๋ถ„์„, ์‹ ์šฉ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ
  • ๋Œ€์ฒด๋ฐ์ดํ„ฐ : ๋น„์ฆˆ๋‹ˆ์Šค ํ”„๋กœ์„ธ์Šค, ์„ผ์„œ, ์œ„์„ฑ ๋ฐ์ดํ„ฐ ๋“ฑ. ์ฐจ๋ณ„์  : ์ตœ์ดˆ ์ •๋ณด. ๋ถ„๋ช… ์‹œ๊ทธ๋„์„ ์ฐพ์•„๋‚ด๋ฉด ๋ฉ”๋ฆฌํŠธ๋Š” ์žˆ์œผ๋‚˜ ๋น„์šฉ์ด ํด ๊ฒƒ์ด๋ผ๋Š” ์ ์—์„œ Trade-off

Bars

  • ๊ตฌ์กฐํ™” ๋˜์ง€ ์•Š์€ ์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™” ์‹œํ‚จ ์ƒํƒœ๋ฅผ ๋งํ•œ๋‹ค. ๊ตฌ์กฐํ™”๋Š” 'Table'๊ณผ ๊ฐ™์€ ์ƒํƒœ๋ผ๊ณ  ๋ณด๊ณ  Bar๋Š” Table์˜ row๋ฅผ ์˜๋ฏธ.
  • ํ”ํžˆ, API๋ฅผ ํ†ตํ•ด์„œ ์ฃผ๋Š” ๋ชจ๋“  ๊ฐ€๊ฒฉ ๋ฐ์ดํ„ฐ๋“ค์ด Bar ํ˜•ํƒœ๋ฅผ ๋„๊ณ  ์žˆ์Œ.

Standard Bars

Time Bars

  • Time Bar์˜ ๊ฒฝ์šฐ ๊ณ ์ •๋œ ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ์œผ๋กœ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๋ณดํŽธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด์„œ ๊ฐ„ํŽธํ•จ
  • ๋ฐ์ดํ„ฐ : timestamp, ohlcv, vwap
  • ๋ฌธ์ œ์  : ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๊ตฌ๊ฐ„์—์„  ๋งŽ์€ ๊ฑฐ๋ž˜๊ฐ€ ์ด๋ค„์ง„๋‹ค. ํ•˜์ง€๋งŒ Time Bar๋Š” ์ด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ์‹œ๊ฐ„ ๋‹จ์œ„์— ๋”ฐ๋ผ ์ผ์ •ํ•œ ์ค‘์š”๋„๋ฅผ ๋ถ€์—ฌํ•ด signal์ด ์™œ๊ณก๋œ๋‹ค.

Tick Bars

  • ๊ฑฐ๋ž˜ ์ฒด๊ฒฐ์ผ ๋ฐœ์ƒํ•˜๋Š” ์‹œ์ ๋งˆ๋‹ค ์ •๋ณด๋ฅผ ์ถ”์ถœํ•จ. ์ฒด๊ฒฐ ๋‹จ์œ„๋กœ๋Š” ์›์†Œ ๋ฐ์ดํ„ฐ.
  • ๋ฌธ์ œ์  : ๋งค์šฐ ํฐ ์ด์ƒ์น˜๊ฐ€ ์กด์žฌํ•จ. ์˜ˆ๋ฅผ๋“ค์–ด, ๋™์‹œํ˜ธ๊ฐ€์˜ ๊ฒฝ์šฐ ์š”์ฒญ๋งŒ ์Œ“์ด๊ณ  ๊ฑฐ๋ž˜๋Š” ์ผ์–ด๋‚˜์ง€ ์•Š์œผ๋ฉฐ ์ตœ์ข…์ ์œผ๋กœ๋Š” ํ•˜๋‚˜์˜ bar๋กœ ์ฒ˜๋ฆฌ๋จ.

Volume Bars

  • Tick Bars๋ฅผ ๊ธฐ์ค€์œผ๋กœ Mandelbrot & Taylor (1967)์€ ํ‘œ๋ณธ ์ถ”์ถœ์„ ๊ฑฐ๋ž˜ ๊ฑด์ˆ˜์˜ ํ•จ์ˆ˜๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด IIDํ•œ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Œ์„ ์ฃผ์žฅํ•˜.
  • ์œ„์˜ ์ด๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณด๋ฉด, ๊ฑฐ๋ž˜ ๊ฑด์ˆ˜ 1๋งŒ๊ฐœ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ˆœ๊ฐ„๋งˆ๋‹ค ์ถ”์ถœ์„ ํ•˜์—ฌ Bars๋ฅผ ๊ตฌ์„ฑํ•จ. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ์˜†์— Timestamp๋Š” ๋น„์ฃผ๊ธฐ์ ์ธ ํ˜•ํƒœ๊ฐ€ ๋จ. Volume Bars๋Š” Time/Tick Bars์— ๋น„ํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ IIDํ™” ๋˜๋ฉฐ ML Model์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํƒœ๊ฐ€ ๋จ.

Dollar Bars

  • ์–ด๋–ค ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ ์ถ”์ถœ์„ ํ•˜๋ฉด IIDํ•œ ํ˜•ํƒœ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Ž. ๋ฌผ๋ก , ํ€€ํŠธ ๋ชจ๋ธ๊ณผ ๊ฐ™์ด ์ด๋ก ์  ๊ทผ๊ฑฐ๊ฐ€ ์žˆ๋Š” ์ถ”์ถœ ๋ฐฉ๋ฒ•์„ ์จ์•ผํ•จ.
  • ๊ทผ๊ฑฐ
    1. ์‹ค์งˆ์ ์œผ๋กœ ์ฃผ์‹์„ ๊ฑฐ๋ž˜ํ•˜๋Š” ์‚ฌ๋žŒ ์ž…์žฅ์—์„  '๊ฑฐ๋ž˜ ๊ฐœ์ˆ˜'๊ฐ€ ์ค‘์š”ํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ '๊ฑฐ๋ž˜ ๊ธˆ์•ก'์ด๊ธฐ ๋•Œ๋ฌธ์— '๊ฑฐ๋ž˜ ๊ธˆ์•ก' ๊ธฐ๋ฐ˜ ์ถ”์ถœ์ด ๋” ์งˆ ์ข‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ๋ƒ„
    2. ์ฃผ์‹์˜ ๊ฒฝ์šฐ ํšŒ์‚ฌ ์‚ฌ์ •์— ๋”ฐ๋ผ ์•ก๋ฉด๋ถ„ํ• , ๋ฐœํ–‰์ฃผ์‹์ˆ˜ ๋ณ€๊ฒฝ์ด ๋œ๋‹ค. ์ฝ”์ธ์˜ ๊ฒฝ์šฐ๋„ ๋” ์ฐ์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ํ•ด๋‹น volume์„ constantํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒฝ์šฐ ์™œ๊ณก ๊ฐ€๋Šฅ.
  • ์ด๋Ÿฐ ๊ทผ๊ฑฐ์— ๋”ฐ๋ผ ์‹ค์ œ๋กœ IID๋ฅผ ์ฒดํฌํ•ด๋ณด๋ฉด Volume Bars ๋ณด๋‹ค ์ข‹๊ฒŒ ๋‚˜์˜ด
  • ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•ด ์œ„์˜ ๋‚ด์šฉ์„ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
def get_bar_index(df, mode, unit=None):
    assert mode in ['time', 'tick', 'volume', 'dollar']
    df0 = df.reset_index().rename(columns={'dates': 'time'})
    num_days = (df0.time.values[-1] - df0.time.values[0]).astype('timedelta64[D]').astype(int)
    t, ts = df0['price' if mode == 'tick' else mode], 0
    idx, diff = [], []
    if mode == 'time':
        assert unit in [None, '1d']
        t, m = t.dt.date, '1d'
        idx.append(0)
        for i, (before, after) in enumerate(zip(t.values[:-1], t.values[1:]), 1):
            if after-before >= timedelta(days=1):
                idx.append(i)
        diff.append(0)
    elif mode == 'tick':
        m = len(df0) // num_days if unit is None else unit
        for i, _ in enumerate(t):
            ts += 1
            if ts == m:
                ts = 0
                idx.append(i)
        diff.append(0)
    else:
        m = t.values.sum() // num_days if unit is None else unit
        for i, x in enumerate(t):
            ts += x
            if ts >= m:
                idx.append(i)
                diff.append(ts)
                ts = 0
    return idx, (m, np.std(diff))

Information-Driven Bars

  • ์ด์ „์—” ๊ฑฐ๋ž˜๋Ÿ‰, ๊ฑฐ๋ž˜๊ธˆ์•ก์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ๊ตฌ์„ฑํ•˜์˜€๋‹ค๋ฉด ์ด๋ฒˆ์—” ์ •๋ณด๋ฅผ ์ข€ ๋” ์ž…์ฒด์ ์œผ๋กœ ํŒŒ์•…ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ‘œ๋ณธ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์„ธ๊ณ„๊ด€์„ ํ™•์žฅ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ์ •๋ณด์˜ ์–‘์„ ์–ด๋–ป๊ฒŒ ์ •๋Ÿ‰ํ™” ์‹œํ‚ฌ ๊ฒƒ์ธ์ง€๊ฐ€ ํ•ต์‹ฌ์ด๊ณ , ์ด์ „์—” ๊ฑฐ๋ž˜๋Ÿ‰,๊ฑฐ๋ž˜๊ธˆ์•ก์ด Constant์˜€๋‹ค๋ฉด ์กฐ๊ธˆ ๋” Dynamicํ•˜๊ฒŒ ๋ฐ”๋€Œ์—ˆ๋‹ค๊ณ  ์ดํ•ดํ•˜๋ฉด ๋จ.
  • ๋Œ€ํ‘œ์ ์œผ๋กœ Tick/Volume/Dollar Imbalance Bars(TIB,VIB,DIB)์™€ Tick/Volume/Dollar Runs Bars (TRB,VRB,DRB)๊ฐ€ ์กด์žฌํ•จ.
  • Imbalance Bar์™€ Runs Bar์˜ ์ฐจ์ด๋Š” ๋Œ€๊ทœ๋ชจ ๊ฑฐ๋ž˜์ž์˜ ํ”์ ์„ ๋‚จ๊ธด ๋งค์ˆ˜ ์ฃผ๋ฌธ(Iceberg orders)์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๋ฅผ ์ถ”๊ฐ€๋กœ ํฌํ•จํ–ˆ๋А๋ƒ ์—ฌ๋ถ€์ด๋‹ค.

Tick Imbalance Bars (TIB)

  1. 2 3 2 1_1
  2. 2 3 2 1_3 2 3 2 1_2
  3. 2 3 2 1_4
  4. 2 3 2 1_6 2 3 2 1_5
  5. 2 3 2 1_7
  6. 2 3 2 1_9
  • tick์˜ sequence๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •, p_t ๋Š” tick t(์‹œ๊ฐ„)์— ์—ฐ๊ณ„๋œ ๊ฐ€๊ฒฉ, v_t ๋Š” tick t(์‹œ๊ฐ„)์— ์—ฐ๊ณ„๋œ ๊ฑฐ๋ž˜๋Ÿ‰
  • b_t๋Š” p_t & v_t๋ฅผ ์ด์šฉํ•ด ์ •์˜ํ•œ ๊ทœ์น™. ์ผ์ข…์˜ ํ‹ฑ ๋‹จ์œ„์˜ ๋ณ€ํ™”๋Ÿ‰(์ถ”์„ธ)๋ฅผ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๊ณ , b_t๊ฐ€ ์ผ์ •ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋Š” ์ผ€์ด์Šค๊ฐ€ ๋‚˜์˜ค๋ฉด ์ด๋•Œ tick์˜ imbalanceํ•œ ํ˜„์ƒ์ด ๋‚˜์˜จ ๊ฒƒ์ด๊ณ  ๊ทธ๋•Œ๋ฅผ ์ถ”์ถœํ•œ๋‹ค๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด theta_t๋ฅผ ์ •์˜.
  • theta_t๋ฅผ ์—ฐ์†์ (ํ™•๋ฅ ์ )์œผ๋กœ ๋ฐ”๊พผ ์‹์„ ํ†ตํ•ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๊ณ , P(b_t=1) + P(b_t=-1) = 1 ์ด๋ผ๋Š” ์ œํ•œ ์กฐ๊ฑด์œผ๋กœ ์‹์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Œ
  • E0[T] ๋Š” previous bars์˜ ์ง€์ˆ˜๊ฐ€์ค‘์ด๋™ํ‰๊ท (ewma)๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ณ , 2P[b_t=1]-1 ์€ previous bars์˜ ์ง€์ˆ˜๊ฐ€์ค‘์ด๋™ํ‰๊ท (ewma)๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • TIB๋Š” ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ๋ถ€๋ถ„ ์ง‘ํ•ฉ์„ T*๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ ๊ฐ€๋Šฅํ•˜๋‹ค.

Volume/Dollar Imbalance Bars

  1. 2 3 2 2_1
  2. 2 3 2 2_2
  3. 2 3 2 2_3
  4. 2 3 2 2_4
  5. 2 3 2 2_5
  6. 2 3 2 2_6
  • VIB, DIB๋Š” TIB์˜ ์„ธ๊ณ„๊ด€์„ ์กฐ๊ธˆ ๋” ํ™•์žฅ์‹œํ‚จ ๊ฒƒ์ด๋‹ค. p_t๋ฅผ ํ†ตํ•ด์„œ๋งŒ information์˜ imbalance๋ฅผ ์ฐพ์•„๋‚ผ ๊ฒƒ์ด๋ƒ v_t๋ž€ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ํ™œ์šฉํ•ด์„œ ์ฐพ์•„๋‚ผ ๊ฒƒ์ด๋ƒ์˜ ๋ฌธ์ œ์ž„.
  • v_t๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š”๊ฒŒ ์™œ ์ข‹์„์ง€๋Š” Researcher์˜ ๋…ผ๋ฆฌ๋ฅผ ํ†ตํ•ด์„œ ์ด๋ก ์  ๊ทผ๊ฑฐ๋ฅผ ํ™•๋ณดํ•ด์•ผ ํ•œ๋‹ค. ์ด ๊ทผ๊ฑฐ๋Š” volume/dollar bars๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ์™€ ๋™์ผํ•จ.
  • theat_t๋ฅผ p_t * v_t๋กœ ํ™•์žฅํ•˜์˜€๊ณ , ์ด์— ๋”ฐ๋ผ ์‹์ด ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ๋‹ค. ๊ทผ๋ณธ์ ์ธ ๋…ผ๋ฆฌ๋Š” TIB์™€ ๋™์ผํ•˜๋‹ค.
  • ์œ„ ์ˆ˜์‹์˜ ๋…ผ๋ฆฌ๋ฅผ ์ข€ ๋” ๋ช…์ง•ํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์•„๋ž˜ pseudo code๋ฅผ ์ฒจ๋ถ€ํ•˜์˜€๋‹ค.
num_prev_bars = 3
expected_num_ticks_init = 100000
expected_num_ticks = expected_num_ticks_init
cum_theta = 0
num_ticks = 0
imbalance_array = []
imbalance_bars = []
bar_length_array = []
for row in data.rows:
    # Track high,low,close, volume info  
    num_ticks += 1
    tick_rule = get_tick_rule(price, prev_price)
    volume_imbalance = tick_rule * row['volume']
    imbalance_array.append(volume_imbalance)
    cum_theta += volume_imbalance
    if len(imbalance_bars) == 0 and len(imbalance_array) >= expected_num_ticks_init:
        expected_imbalance = ewma(imbalance_array, window=expected_num_ticks_init)
       
    if abs(cum_theta) >= expected_num_ticks * abs(expected_imbalance):
        bar = form_bar(open, high, low, close, volume)
        imbalance_bars.append(bar)
        bar_length_array.append(num_ticks)
        cum_theta, num_ticks = 0, 0
        expected_num_ticks = ewma(bar_lenght_array, window=num_prev_bars)
        expected_imbalance = ewma(imbalance_array, window = num_prev_bars * expected_num_ticks)

Tick Runs Bars

  1. 2 3 2 3_1
  2. 2 3 2 3_2
  3. 2 3 2 3_3
  • TIB์™€ ๊ทผ๋ณธ์ ์ธ ์›๋ฆฌ๋Š” ๋™์ผํ•˜๋‹ค. ๋Œ€๋Ÿ‰ ๊ฑฐ๋ž˜์ž๋“ค์˜ ๋ถ„ํ•  ์ฃผ๋ฌธ์˜ ํ”์ ์„ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ theta_t์˜ ๋ชจ๋ธ๋ง์„ ๋‹ค๋ฅด๊ฒŒ ํ•œ ๊ฒƒ์ด ์ฃผ์š”ํ•œ ์ฐจ์ด์ ์ด๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์ ์œผ๋ก  max๋ฅผ ํ™œ์šฉํ•˜์—ฌ b_t=1,-1 ๊ฐ๊ฐ์˜ ๊ฒฝ์šฐ๋ฅผ ๋‚˜๋ˆ ์„œ ๋ณธ๋‹ค๋Š” ์ ์ด ์ด ๋ถ€๋ถ„์„ ๋ฐ˜์˜ํ•œ ๊ฒƒ์ด๋‹ค.
  • theta_t์˜ ๋ชจ๋ธ๋ง ์™ธ์— ๋‹ค๋ฅธ ๊ฒƒ๋“ค์˜ ์›๋ฆฌ๋Š” ๋™์ผ.
  • ์ฆ‰, T*๋ฅผ ๋ณด๋ฉด Iceberg orders๊ฐ€ ๋งŽ์œผ๋ฉด ๋‚ฎ์€ T๊ฐ’์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์ถ”์ถœ๋  ๊ฒƒ์ด๊ณ  ์ฃผ์š”ํ•œ ๋ถ€๋ถ„์—์„œ ML model์ด ํ•™์Šตํ•  ๊ธฐํšŒ๋ฅผ ์ œ๊ณตํ•˜๋Š” ์ข‹์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋˜๋Š” ์ฒ˜๋ฆฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Œ.
  • ์—ฌ๊ธฐ์— ๋”ํ•ด Iceberg orders๊ฐ€ ์ ์œผ๋ฉด TIB์— ๋น„ํ•ด sequence๊ฐ€ sparseํ•˜๊ฒŒ ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•จ.

Volume/Dollar Runs Bars

  1. 2 3 2 4_1
  2. 2 3 2 4_2
  3. 2 3 2 4_3
  • VIB/DIB์™€์˜ ํ•ต์‹ฌ ์ฐจ์ด๋Š” ์œ„์—์„œ ์„œ์ˆ ํ•œ TIB vs TRB์™€ ๋™์ผํ•จ.
  • TRB์™€ VRB/DRB์˜ ์ฐจ์ด๋Š” theta_t์˜ ๋ชจ๋ธ๋ง์— v_t๋ฅผ ์ถ”๊ฐ€ํ–ˆ๋‹ค๋Š” ์ ์ด๋ฉฐ ์ด๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ทผ๊ฑฐ๋Š” TIB์—์„œ VIB/DIB๋กœ ํ™•์žฅํ•˜๋Š” ๊ทผ๊ฑฐ์™€ ๋™์ผํ•จ

Sampling Features

  • ์ฒ˜์Œ ์‹œ์ž‘ํ–ˆ๋˜ Financial raw data -> Bars(Time/Volume/Dollar Bars -> TIB/DIB/VIB -> TRB/DRB/VRB) ๊นŒ์ง€ ํ™•์žฅํ–ˆ๋‹ค.
  • ์ข‹์€ Bars๋ฅผ ๋งŒ๋“œ๋Š” ๋ชฉ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ทธ๋„์„ ์ตœ๋Œ€ํ•œ ์œ ์ง€ํ•˜๋ฉด์„œ IIDํ•˜๊ฒŒ ๋งŒ๋“ค๊ธฐ ์œ„ํ•จ์ด๊ณ , ์ด๋Š” ML model์˜ ํ•™์Šต์„ ๊ทน๋Œ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์— ์žˆ๋‹ค.
  • ์ด๋Ÿฐ Bars๋ฅผ ๋งŒ๋“  ์ƒํƒœ์—์„œ ์ถ”๊ฐ€๋กœ ํ•œ ๋ฒˆ ๋” Filtering(Sampling Features)์„ ๊ฑฐ์ณ์•ผ ํ•˜๋Š” ์ด์œ ๊ฐ€ ์žˆ๋‹ค.
    1. ๋ช‡๋ช‡ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ‘œ๋ณธ ํฌ๊ธฐ์— ํ•™์Šต ์†๋„๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. (ex. SVM)
    2. ML ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์‚ฌ๋žŒ์ด ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์„ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ๋‹ค ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” ์‹œํ—˜์— ์ž์ฃผ ๋‚˜์˜ค๊ณ  ์ค‘์š”ํ•œ ๋ฌธ์ œ๋ฅผ ํ•™์Šตํ•˜๋Š”๊ฒŒ ์ค‘์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด 'ํ•™์Šต ์œ ๊ด€ ๋ฐ์ดํ„ฐ'๋ฅผ ์„ ๋ณ„ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ •๊ตํ•œ ์˜ˆ์ธก์„ ํ•˜๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด ๋‚ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค.
  • Down Sampling ๊ธฐ๋ฒ• 2๊ฐ€์ง€, Sampling for Reduction & Event-Based Sampling์„ ์†Œ๊ฐœํ•œ๋‹ค. ํ›„์ž๊ฐ€ ํ•ต์‹ฌ.

Sampling for Reduction

  • ์•„์ฃผ ๊ฐ„๋‹จํ•œ ๊ธฐ๋ฒ•์œผ๋กœ ์ •๋ง ์†๋„๋งŒ์„ ์œ„ํ•ด ํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” B2C ์„œ๋น„์Šค ๋“ฑ ์†๋„๊ฐ€ ์ค‘์š”ํ•˜๊ณ  ์„ฑ๋Šฅ์ด ์กฐ๊ธˆ์€ ๋‚ฎ์•„๋„ ๋˜๋Š” ๊ฒฝ์šฐ์— ๊ฐ€์„ฑ๋น„ ์ข‹๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.
    1. Linespace Sampling
      • ์ˆœ์ฐจ์ ์œผ๋กœ ํ‘œ๋ณธ์„ ์ถ”์ถœํ•œ๋‹ค. ๋‹ค๋งŒ, ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ ์–ด๋–ค ๊ฐ„๊ฒฉ(seed)์— ์˜ํ•ด ํ•  ๊ฒƒ์ธ์ง€์— ๋ฏผ๊ฐํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•จ.
    2. Uniform Sampling
      • Linespace Sampling์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ „์ฒด Bar์—์„œ ๊ท ์ผํ•˜๊ฒŒ ํ‘œ๋ณธ์„ Sampling ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ตœ์ข… bar์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ผ์ •ํ•ด์ง„๋‹ค.

Event-Based Sampling

  • ML Model์˜ ํ•ต์‹ฌ์€ ๋˜‘๋˜‘ํ•œ ํˆฌ์ž์ž๋ฅผ ์—ฌ๋Ÿฌ ๋ช… ๋งŒ๋“ค์–ด ๊ทธ๊ฒƒ์„ nettingํ•˜์—ฌ Meta Strategy๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •์˜ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ž๋™ํ™” ํ•˜๋Š” ๊ฒƒ์— ๊ฐ•์ ์ด ์žˆ๋‹ค.
  • ๋˜‘๋˜‘ํ•œ ํˆฌ์ž์ž ํ•œ ๋ช…์˜ ๊ด€์ ์—์„œ ํˆฌ์ž ๋ฐฉ์‹์„ ๊ณ ๋ฏผํ•ด๋ณด๋ฉด ๋งŽ์€ ๊ธฐํšŒ๋Š” ๋ณ€๋™์„ฑ์ด ํ„ฐ์ง€๋Š” ์ˆœ๊ฐ„์— ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์ ์„ ๋ชฉ๊ฒฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ ๊ตฌ๊ฐ„์—์„œ ๋งŽ์€ Alpha๊ฐ€ ์กด์žฌํ•œ๋‹ค.
  • ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ ๊ตฌ๊ฐ„์—์„œ ์ž˜ ๋งž์ถฐ๋„ Alpha๊ฐ€ ๋งŽ์ด ์กด์žฌํ•˜๋Š” ๊ตฌ๊ฐ„์—์„œ ์ž˜ ๋งž์ถ”์ง€ ๋ชปํ•œ๋‹ค๋ฉด ์ด๋Š” ์ข‹์€ ๋ชจ๋ธ์ด๋ผ ํ•  ์ˆ˜ ์—†์„ ๊ฒƒ์ด๋‹ค.
  • ํ•ด๋‹น ๊ตฌ๊ฐ„์— ์ข€ ๋” ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด์„œ ํ•™์Šตํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฑธ ๋„์™€์ฃผ๋Š” ๋ฐฉ์‹์ด Event-Based Sampling์ด๋ผ๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•˜์ง€๋งŒ, ์‚ฐ์—…๊ณตํ•™์—์„œ ํ’ˆ์งˆ ๊ด€๋ฆฌ๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋А CUSUM filter์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•œ๋‹ค.

The CUSUM Filter

  1. 2 5 2 1_1
  2. 2 5 2 1_2
  3. 2 5 2 1_3
  4. 2 5 2 1_4
  • CUSUM Filter์˜ ํ•ต์‹ฌ์€ ์ธก์ • ๊ฐ’์ด ๋ชฉํ‘œ ๊ฐ’์˜ ํ‰๊ท ์œผ๋กœ๋ถ€ํ„ฐ ์–ผ๋งˆ๋‚˜ ๋ฒ—์–ด๋‚ฌ๋Š”์ง€๋ฅผ ์ฐพ๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ๋‹ค.
  • IIDํ•œ ๊ด€์ธก๊ฐ’ y_t๋ฅผ ์ •์˜ํ•˜๊ณ , ๋‹ค์Œ y_t์˜ ๋ˆ„์  ํ•ฉ๊ณ„๋ฅผ S_t๋กœ ์ •์˜ํ•˜์ž. E_(t-1)[y_t]๋Š” previous t๊นŒ์ง€์˜ ์ง€์ˆ˜์ด๋™ํ‰๊ท  ๋˜๋Š” y_(t-1)๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • S_t์˜ ๊ฐ’์„ ํ†ตํ•ด ํ˜„์žฌ ๊ฐ’์ด ํ‰๊ท ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋ฒ—์–ด๋‚ฌ๋Š”์ง€ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋•Œ, ์ž„๊ณ„๊ฐ’ h๋ฅผ ์„ค์ •ํ•œ๋‹ค. h๋ฅผ ๋„˜์–ด๊ฐ€๋Š” ์ˆœ๊ฐ„๋งˆ๋‹ค Sampling ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • 4.์˜ ์‹์—์„  ์ƒํ–ฅ/ํ•˜ํ–ฅ์˜ ๊ฐœ๋…์œผ๋กœ ํ™•์žฅํ•ด์„œ ์ƒํ–ฅ/ํ•˜ํ–ฅ S_t์™€ ์ƒํ–ฅ/ํ•˜ํ–ฅ h๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋“ค์˜ ์ ˆ๋Œ€๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์กฐ๊ธˆ ๋” ์ •๊ตํ•˜๊ฒŒ ์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•จ.
  • ์ž„๊ณ„๊ฐ’์„ ํ˜„์žฌ๋Š” constant๋กœ ๊ฐ€์ •ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ก ์ ์œผ๋กœ Structural Breaks, Entropy Features, Microstructural Features๋“ฑ ๋‹ค์–‘ํ•˜๊ฒŒ ํ™•์žฅ ๊ฐ€๋Šฅํ•˜๋‹ค.
  • ์ƒํ–ฅ/ํ•˜ํ–ฅ ๊ฐœ๋…์œผ๋กœ ํ™•์žฅ๋œ CUSUM Filter๋ฅผ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
def getTEvents(gRaw, upper, lower=None):
    if lower is None:
        lower = - upper
    assert (upper >=0 and lower <= 0)
    tEvents, sPos, sNeg = [], 0, 0
    diff = gRaw.diff()
    for i, change in enumerate(diff.values[1:]):
        sPos, sNeg = max(0, sPos + change), min(0, sNeg + change)

        if sNeg < lower:
            sNeg = 0
            tEvents.append(i)

        if sPos > upper:
            sPos = 0
            tEvents.append(i)

    return gRaw.index[tEvents]