Google BERT - yarak001/machine_learning_common GitHub Wiki

์ˆœ์„œ: BERT ์‹œ์ž‘ํ•˜๊ธฐ(Transformer, ์ดํ•ด, ํ™œ์šฉ) -> BERT ํŒŒ์ƒmodel(๋‹ค๋ฅธ๊ตฌ์กฐ ๋ฐ ํ•™์Šต๋ฐฉ์‹์— ์˜ํ•œ & ์ง€์‹์ฆ๋ฅ˜ ๊ธฐ๋ฐ˜์— ์˜ํ•œ) -> BERT ์ ์šฉํ•˜๊ธฐ(Text ์š”์•ฝ, ์–ธ์–ด, Sentence, Domain, Video, BART)

1. Transfomer

  • RNN๊ณผ LSTM์€ ๋‹ค์Œ ๋‹จ์–ด์˜ˆ์ธก, ๊ธฐ๊ณ„๋ฒˆ์—ญ, text ์ƒ์„ฑ๋“ฑ ์ˆœ์ฐจ์  task์— ๋งŽ์ด ์‚ฌ์šฉ
  • ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ
  • RNN์˜ ํ•œ๊ณ„์ ์€ ๊ทน๋ณตํ•˜๋ ค "Attention is all your need" ๋…ผ๋ฌธ์—์„œ transformer ์ œ์•ˆ
  • RNN์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ˆœํ™˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ˆœ์ˆ˜ํ•˜๊ฒŒ attention๋งŒ ์‚ฌ์šฉํ•œ model, self attetion์ด๋ผ๋Š” ํŠน์ˆ˜ํ•œ ํ˜•ํƒœ์˜ attention ์‚ฌ์šฉ transformer

2. BERT ์ดํ•ดํ•˜๊ธฐ

  • BERT(Bidiectional Encoder Representataion from Transformer)๋Š” ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ ์ตœ์‹  embedding model
  • BERT๋Š” ๋ฌธ๋งฅ ๊ธฐ๋ฐ˜(context-based) embedding model (Word2Vec์€ ๋ฌธ๋งฅ ๋…๋ฆฝ(context-free) embedding model)
    • A๋ฌธ์žฅ: He got bit by Python
    • B๋ฌธ์žฅ: Python is my favorite programming language
    • Word2Vec๊ฐ€ ๋ฌธ๋งฅ ๋…๋ฆฝ model์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ๋งฅ๊ณผ ๊ด€๊ณ„์—†์ด 'python'์ด๋ผ๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ํ•ญ์ƒ ๋™์ผํ•œ embedding์ œ๊ณต
    • Bert๋Š” ๋ฌธ๋งฅ ๊ธฐ๋ฐ˜ model์ด๋ฏ€๋กœ ๋ฌธ์žฅ์˜ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•œ ๋‹ค์Œ ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ๋‹จ์–ด embedding ์ƒ์„ฑ
    • Python๊ณผ ๋‹ค๋ฅธ ๋ชจ๋“  ๋‹จ์–ด์˜ ๊ด€๊ณ„
    • Python๊ณผ ๋‹ค๋ฅธ ๋ชจ๋“  ๋‹จ์–ด์˜ ๊ด€๊ณ„
  • BERT๋Š” transformer model์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, encoder-decoder๊ฐ€ ์žˆ๋Š” transfomer model๊ณผ ๋‹ฌ๋ฆฌ encoder๋งŒ ์‚ฌ์šฉ
  • Encoder๋Š” multi-head attention์„ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด์˜ ๋ฌธ๋งฅ์„ ์ดํ•ดํ—ˆ ๋ฌธ์žฅ์— ์žˆ๋Š” ๊ฐ ๋‹จ์–ด์˜ ๋ฌธ๋งฅ ํ‘œํ˜„์„ ์ถœ๋ ฅ์œผ๋กœ ๋ฐ˜ํ™˜
    • BERT์— ์ž…๋ ฅ๋œ A๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด ํ‘œํ˜„ ์ถœ๋ ฅ
  • BERB-base, BERT-large๋“ฑ ์—ฌ๋Ÿฌ ๊ตฌ์กฐ๋ฅผ ์ง€๋‹˜
  • BERT ์‚ฌ์ „ํ•™์Šต
    • model์ด ์ด๋ฏธ ๋Œ€๊ทœ๋ชจ dataset์—์„œ ํ•™์Šต๋˜์—ˆ์œผ๋ฏ€๋กœ ์ƒˆ task๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด model๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๋Š” ๋Œ€์‹  ์‚ฌ์ „ ํ•™์Šต๋œ model์„ ์‚ฌ์šฉํ•˜๊ณ  ์ƒˆ๋กœ์šด task์— ๋‹ค๋ž€ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •(fine tuning)ํ•˜๋Š” ๋ฐฉ์‹์„ ๋งŽ์ด ์‚ฌ์šฉ
    • BERT๋Š” MLM(Masked Language Model)๊ณผ NSP(Next Sentence Prediction)์œผ๋กœ ์‚ฌ์ „ํ•™์Šต
  • BERT ์ž…๋ ฅ
    • A๋ฌธ์žฅ: Paris is a beautiful city
    • B๋ฌธ์žฅ: I love Paris
    • ์ž…๋ ฅ์˜ ์ตœ์ข… ํ‘œํ˜„
      • Token Embeddig
        • [CLS] token์€ ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— ์ถ”๊ฐ€๋˜๋ฉฐ ๋ถ„๋ฅ˜ ์ž‘์—…์— ์‚ฌ์šฉ, [SEP] token์€ ๋ชจ๋“  ๋ฌธ์žฅ์˜ ๋์— ์ถ”๊ฐ€๋˜๋ฉฐ ๋ฌธ์žฅ์˜ ๋์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ ์‚ฌ์šฉ
        • Token Embedding ๋ณ€์ˆ˜๋“ค์€ ์‚ฌ์ „ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ํ•™์Šต๋จ
      • Segment Embedding
        • ๋‘ ๋ฌธ์žฅ์„ ๊ตฌ๋ณ„ํ•˜๋Š”๋ฐ ์‚ฌ์šฉ
      • Positional Embedding
        • Transformer๊ฐ€ ์–ด๋–ค ๋ฐ˜๋ณต mechanism๋„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋ชจ๋“  ๋‹จ์–ด๋ฅด ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ๋‹จ์–ด ์ˆœ์„œ์™€ ๊ด€๋ จ๋œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด์•ผ ํ•˜๋Š”๋ฐ ์ด๋•Œ ์œ„์น˜ Positional Embedding์‚ฌ์šฉ
    • WordPiece Tokenizer
      • BERT์—์„œ ์‚ฌ์šฉํ•˜๋Š” sub word tokenizer
      • OOV(out-of-vocabulary) ์ฒ˜๋ฆฌ์— ํšจ๊ณผ์ 
  • BERT ์‚ฌ์ „ ํ•™์Šต ์ „๋žต
    • MLM(Masked Language Modeling)
      • ๋นˆ์นธ ์ฑ„์šฐ๊ธฐ task(cloze task)๋ผ๊ณ ๋„ ํ•จ
      • BERT
      • ์–ธ์–ด ๋ชจ๋ธ๋ง(Language Modeling): ์ผ๋ฐ˜์ ์œผ๋กœ ์ž„์˜์˜ ๋ฌธ์žฅ์ด ์ฃผ์–ด์ง€๊ณ  ๋‹จ์–ด์˜ ์ˆœ์„œ๋Œ€๋กœ ๋ณด๋ฉด์„œ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก model์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ
        • ์ž๋™ ํšŒ๊ท€ ์–ธ์–ด ๋ชจ๋ธ๋ง(Auto-regressive Language Modeling)
          • ์ „๋ฐฉ ์˜ˆ์ธก(Forward(left to right) prediction)
          • ํ›„๋ฐฉ ์˜ˆ์ธก(Backwrad(right to left) prediction)
        • ์ž๋™ ์ธ์ฝ”๋”ฉ ์–ธ์–ด ๋ชจ๋ธ๋ง(Auto-Encoding Language Modeling)
          • ์–‘๋ฐฉํ–ฅ ์—์ธก์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์žฅ ์ดํ•ด ์ธก๋ฉด์—์„œ ๋” ๋ช…ํ™•ํ•ด์ง€๋ฏ€๋กœ ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณต
      • ์ฃผ์–ด์ง„ ๋ฌธ์ž์—์„œ ์ „์ฒด ๋‹จ์–ด์˜ 15%๋ฅผ ๋ฌด์ž‘์œ„ maskingํ•˜๊ณ  mask๋œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก model์„ ํ•™์Šต
        • ([MASK] token์€ ์‚ฌ์ „ํ•™์Šต์‹œ์—๋งŒ ์‚ฌ์šฉํ•˜๊ณ  fine-tuning์ž…๋ ฅ์—๋Š” ์—†์œผ๋ฏ€๋กœ)์‚ฌ์ „ํ•™์Šต๊ณผ file tuning์‚ฌ์ด์— ๋ถˆ์ผ์น˜๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋จ. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ 80-10-10๊ทœ์ง ์ ์šฉ
          • 15%์ค‘ 80%์˜ token(์‹ค์ œ ๋‹จ์–ด)์„ [MASK] token์œผ๋กœ ๊ต์ฒด.
          • 15%์ค‘ 10%์˜ token(์‹ค์ œ ๋‹จ์–ด)์„ ์ž„์˜์˜ token(์ž„์˜๋‹จ์–ด)๋กœ ๊ต์ฒด
          • 15%์ค‘ 10%์˜ token๋Š” ์–ด๋–ค ๋ณ€๊ฒฝ๋„ ํ•˜์ง€ ์•Š์Œ
      • mask๋œ token ์˜ˆ์ธก
        • Mask๋œ token์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด BERT์—์„œ ๋ฐ˜ํš๋œ mask๋œ Token R[mask]์˜ ํ‘œํ˜„์„ softmax ํ™œ์„ฑํ™”๋ฅผ ํ†ตํ•ด FFNN์— ์ „๋‹ฌ ํ›„ ํ™•๋ฅ  ์ถœ๋ ฅ
    • NSP(Next Sentence Prediction)
      • ์ด์ง„ ๋ถ„๋ฅ˜ task, BERT์— ๋‘ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•˜๊ณ  ๋‘ ๋ฒˆ์งธ ๋ฌธ์žฅ์ด ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ๋‹ค์Œ ๋ฌธ์žฅ์ธ์ง€๋ฅผ ์˜ˆ์ธก
      • ๋‘ ๋ฌธ์žฅ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ด๋Š” ์งˆ๋ฌธ-์‘๋‹ต ๋ฐ ์œ ์‚ฌ๋ฌธ์žฅํƒ์ง€์™€ ๊ฐ™์€ downstream task์— ์œ ์šฉ
      • NSP
        • [CLS] token์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ๋ชจ๋“  token์˜ ์ง‘๊ณ„ ํ‘œํ˜„์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ๋ฌธ์žฅ ์ „์ฒด์— ๋Œ€ํ•œ ํ‘œํ˜„์„ ๋‹ด๊ณ  ์žˆ์Œ
  • ์‚ฌ์ „ ํ•™์Šต ์ ˆ์ฐจ
    • BERT์‚ฌ์ „ ํ•™์Šต์—์„œ๋Š” ํ† ๋ก ํ†  ์ฑ… ๋ง๋ญ‰์น˜(Toronto BookCorpus)์™€ Wikipedia dataset์‚ฌ์šฉ
    1. ๋ง์›…์น˜์—์„œ ๋‘ ๋ฌธ์žฅ์„ sampliing(์ „์ฒด์˜ 50%๋Š” B๋ฌธ์žฅ์ด A๋ฌธ์žฅ์˜ ํ›„์† ๋ฌธ์žฅ์ด ๋˜๋„๋ก, ๋‚˜๋จธ์ง€ 50%๋Š” B๋ฌธ์žฅ์ด A๋ฌธ์žฅ์˜ ํ›„์† ๋ฌธ์žฅ์ด ์•„๋‹Œ ๊ฒƒ์œผ๋กœ sampling)
    • A๋ฌธ์žฅ: We enjoyed the game
    • B๋ฌธ์žฅ: Turn the radio on
    1. Wordpiece tokeinzer๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ฌธ์žฅ tokenํ™”, ์ฒซ๋ฒˆ์งธ ๋ฌธ์žฅ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— [CLS] token์„ ๋ชจ๋“  ๋ฌธ์žฅ์˜ ๋์— [SEP] token์ถ”๊ฐ€
    • tokens = [ [CLS], we, enjoyed, the, game, [SEP], turn, the, radio, on, [SEP] ]
    1. 80-10-10 ๊ทœ์น™์— ๋”ฐ๋ผ token์˜ 15%๋ฅผ ๋ฌด์ž‘์œ„๋กœ masking
    • tokens = [ [CLS], we, enjoyed, the, [MASK], [SEP], turn, the, radio, on, [SEP] ]
    1. BERT model ์— ์ž…๋ ฅ
    2. Mask๋œ token์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด model์„ ํ•™์Šต์‹œํ‚ค๋ฉฐ ๋™์‹œ์— B๋ฌธ์žฅ์ด A๋ฌธ์žฅ์˜ ํ›„์† ๋ฌธ์žฅ์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จ. ์ฆ‰ MLM๊ณผ NSP์ž‘์—…์„ ๋™์‹œ์— ์‚ฌ์šฉํ•ด BERT๋ฅผ ํ•™์Šต
  • ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ๋ฐ”์ดํŠธ ์Œ ์ธ์ฝ”๋”ฉ(Byte Pair Encoding)
    • ๋ฐ”์ดํŠธ ์ˆ˜์ค€ ๋ฐ”์ดํŠธ ์Œ ์ธ์ฝ”๋”ฉ(Byte-level Byte Pair Encoding)
    • ์›Œ๋“œํ”ผ์Šค(WordPiece)

3. BERT ํ™œ์šฉํ•˜๊ธฐ

  • ์‚ฌ์ „ ํ•™์Šต๋œ BERT model ํƒ์ƒ‰
    • BERT๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ์‚ฌ์ „ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๋ฏ€๋กœ, ์‚ฌ์ „ ํ•™์Šต๋œ ๊ณต๊ฐœ BERT model์„ downloadํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ํšจ๊ณผ์ 
H=128 H=256 H=512 H=768
L=2 [2/128 (BERT-Tiny)][2_128] [2/256][2_256] [2/512][2_512] [2/768][2_768]
L=4 [4/128][4_128] [4/256 (BERT-Mini)][4_256] [4/512 (BERT-Small)][4_512] [4/768][4_768]
L=6 [6/128][6_128] [6/256][6_256] [6/512][6_512] [6/768][6_768]
L=8 [8/128][8_128] [8/256][8_256] [8/512 (BERT-Medium)][8_512] [8/768][8_768]
L=10 [10/128][10_128] [10/256][10_256] [10/512][10_512] [10/768][10_768]
L=12 [12/128][12_128] [12/256][12_256] [12/512][12_512] [12/768 (BERT-Base)][12_768]
  • ์‚ฌ์ „ํ•™์Šต mdoel ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
    • Embedding์„ ์ถ”์ถœํ•ด ํŠน์ง• ์ถ”์ถœ๊ธฐ๋กœ ์‚ฌ์šฉ
    • ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์„ Text๋ถ„๋ฅ˜, ์งˆ๋ฌธ-์‘๋‹ต๋“ฑ๊ณผ ๊ฐ™์€ downstream task์— ๋งž๊ฒŒ fine tuning
  • ์‚ฌ์ „ ํ•™์Šต๋œ BERT์—์„œ Embedding์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•
    • ๋ฌธ์žฅ: I love Paris
    1. WordPiece tokenizer๋ฅผ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ์„ tokenํ™”
    • tokens = [I, love, Paris]
    1. token list ์‹œ์ž‘ ๋ถ€๋ถ„์— [CLS], ๋์— [SEP] token ์ถ”๊ฐ€
    • tokens = [ [CLS], I, love, Paris, [SEP] ]
    1. ๋™์ผ ๊ธธ์ด๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ [PAD] token ์ถ”๊ฐ€(๊ธธ์ด๊ฐ€ 7์ด๋ผ ๊ฐ€์ •)
    • tokens = [ [CLS], I, love, Paris, [SEP], [PAD], [PAD] ]
    1. [PAD] token ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•œ token์ด๋ฉฐ ์‹ค์ œ token์˜ ์ผ๋ถ€๊ฐ€ ์•„๋‹ˆ๋ž€ ๊ฒƒ์„ model์—๊ฒŒ ์ดํ•ด์‹œํ‚ค๊ธฐ ์œ„ํ•ด attention mask ์ƒ์„ฑ
    • attention_mask = [1,1,1,1,1,0,0 ]
    1. ๋ชจ๋“  token์„ ๊ณ ์œ  token ID๋กœ mapping
    • token_ids = [101, 1045, 2293, 3000, 102, 0, 0 ]
    1. ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์— ๋Œ€ํ•œ ์ž…๋ ฅ์œผ๋กœ attention mask์™€ token_ids๋ฅผ ๊ณต๊ธ‰ํ•˜๊ณ  ๊ฐ token์— ๋Œ€ํ•œ embedding์„ ์–ป์Œ
    • ์ „์ฒด ๋ฌธ์žฅ์˜ ํ‘œํ˜„์€ [CLS] token์— ๋ณด์œ ํ•จ. [CLS] token ํ‘œํ˜„์„ ๋ฌธ์žฅ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•ญ์ƒ ์ข‹์€ ์ƒ๊ฐ์€ ์•„๋‹˜. ๋ฌธ์žฅ์˜ ํ‘œํ˜„์„ ์–ป๋Š” ํšจ์ธŒ์ ์ธ ๋ฐฉ๋ฒ•์€ ๋ชจ๋“  token์˜ ํ‘œํ˜„์„ ํ‰๊ท ํ™”ํ•˜๊ฑฐ๋‚˜ poolingํ•˜๋Š” ๊ฒƒ์ž„
  • ๋‹ค์šด์ŠคํŠธ๋ฆผ task๋ฅผ ์œ„ํ•œ BERT ํŒŒ์ธ ํŠœ๋‹ ๋ฐฉ๋ฒ•
    • BERT์˜ ์ „์ฒด์ ์ธ ๊ทธ๋ฆผ
    • BERT fine tuning
    • Text ๋ถ„๋ฅ˜
      • ์‚ฌ์ „ํ•™์Šต๋œ BERT model์„ fine tuningํ• ๋•Œ ๋ถ„๋ฅ˜๊ธฐ์™€ ํ•จ๊ป˜ model์˜ ๊ฐ€์ค‘์น˜๋ฅผ updateํ•˜์ง€๋งŒ, ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์„ ํŠน์ง• ์ถ”์ถœ๊ธฐ๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์•„๋‹Œ ๋ถ„๋ฅ˜๊ธฐ์˜ ๊ฐ€์ค‘์น˜๋งŒ update๋จ
      • Fine tuning ์ค‘์— ๋‹ค์Œ ๋‘๊ฐ€์ง• ๋ฐฉ๋ฒ•์œผ๋กœ model์„ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Œ
        • ๋ถ„๋ฅ˜ ๊ณ„์ธต๊ณผ ํ•จ๊ผ ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์˜ ๊ฐ€์ค‘์น˜๋ฅผ update
        • ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์ด ์•„๋‹Œ ๋ถ„๋ฅ˜ ๊ณ„์ธต์˜ ๊ฐ€์ค‘์น˜๋งŒ update. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์„ ํŠน์ง• ์ถ”์ถœ๊ธฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ
    • ์ž์—ฐ์–ด ์ถ”๋ก (NLI)
      • ์ž์—ฐ์–ด ์ถ”๋ก (NLI, Natural Language Inference)์€ model์ด ๊ฐ€์ •์ด ์ฃผ์–ด์ง„ ์ „์ €์— ๋Œ€ํ•ด์„œ ์ฐธ์ธ์ง€ ๊ฑฐ์ง“์ธ์ง€ ์ค‘๋ฆฝ์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” task
    • ์งˆ๋ฌธ-์‘๋‹ต
      • ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์‘๋‹ต์ด ํฌํ•จ๋œ ๋‹จ๋ฝ๊ณผ ํ•จ๊ป˜ ์งˆ๋ฌธ์ด ์ œ๊ณต๋˜๋ฉด model์€ ์ฃผ์–ด์ง„ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต์„ ๋‹จ๋ฝ์—์„œ ์ถ”์ถœํ•จ
      • ์ž…๋ ฅ๊ฐ’์€ ์งˆ๋ฌธ-๋‹จ๋ฝ์Œ, ์ถœ๋ ฅ์€ ์‘๋‹ต์— ํ•ด๋‹นํ•˜๋Š” text์˜ ๋ฒ”์œ„
      • ๋‹จ๋ฝ ๋‚ด ๋‹ต์˜ ์‹œ์ž‘๊ณผ ๋ token(๋‹จ์–ด)์˜ ํ™•๋ฅ  ๊ณ„์‚ฐ
    • ๊ฐœ์ฒด๋ช… ์ธ์‹(NER)
      • ๊ฐœ์ฒด๋ช… ์ธ์‹(NER, Named Entity Recognition)์€ ๊ฐœ์ฒด๋ช…์„ ๋ฏธ๋ฆฌ ์ •์˜๋œ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ

4. BERT ํŒŒ์ƒ model I

  • ALBERT
    • BERT์˜ ์ฃผ์š” ๋ฌธ์ œ์  ์ค‘ ํ•˜๋‚˜๋Š” ์ˆ˜๋ฐฑ๋งŒ ๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์–ด, model ํ•™์Šต์ด ์–ด๋ ต๊ณ , ์ถ”๋ก ์‹œ ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆผ. model ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๊ฒŒ ๋˜๋ฉด ์„ฑ๋Šฅ์€ ์ข‹์ง€๋งŒ, ๊ณ„์‚ฐํ•  ๋•Œ resouce ์ œํ•œ ๋ฐœ์ƒ
    • ์œ„ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ALBERT๊ฐ€ ๋„์ž…๋˜์—ˆ์œผ๋ฉด ๋‹ค์Œ ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์‚ฌ์šฉ ๋ณ€์ˆ˜์–‘์„ ์ถœ์†Œํ•˜์—ฌ ํ•™์Šต์‹œ๊ฐ„๊ณผ ์ถ”๋ก ์‹œ๊ฐ„ ์ค„์ž„
      • ํฌ๋กœ์Šค ๋ ˆ์ด์–ด ๋ณ€์ˆ˜ ๊ณต์œ (cross-layer parameter sharing)
        • BERT model์˜ ๋ณ€์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ค‘ ํ•˜๋‚˜. ๋ชจ๋“  encoder์˜ ๋ณ€์ˆ˜๋ฅด ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ฒซ๋ฒˆ์งธ encoder์˜ ๋ณ€์ˆ˜๋งŒ ํ•™์Šตํ•œ ๋‹ค์Œ ์ฒซ๋ฒˆ์งธ encoder layer ๋ณ€์ˆ˜์„ ๋‹ค๋ฅธ ๋ชจ๋“  encoder layer์™€ ๊ณต์œ  BERT
          • All-shared: ์ฒซ๋ฒˆ์งธ encoder์˜ ํ•˜์œ„ layer์— ์žˆ๋Š” ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ๋‚˜๋จธ์ง€ encoder์™€ ๊ณต์œ 
          • Shared feed forward network: ์ฒซ๋ฒˆ์งธ encoder layer์˜ feed forward network์˜ ๋ณ€์ˆ˜๋งŒ ๋‹ค๋ฅธ encoding layer์˜ feed forward network์™€ ๊ณต์œ 
          • Shared attention: ์ฒซ๋ฒˆ์งธ encoder layer์˜ multi head attention ๋ณ€์ˆ˜๋งŒ ๋‹ค๋ฅธ encoder layer์™€ ๊ณต์œ 
        • ํŒฉํ† ๋ผ์ด์ฆˆ ์ž„๋ฒ ๋”ฉ ๋ณ€์ˆ˜ํ™”(factorized embedding parameterization)
          • Embeddingํ–‰๋ ฌ์„ ๋” ์ž‘์€ ํ–‰๋ ฌ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๋ฐฉ๋ฒ• BERT
    • ALBERT model ํ•™์Šต
      • ALBERT๋Š” MLM์€ ์‚ฌ์šฉํ•˜์ง€๋งŒ NSP task ๋Œ€์‹  ๋ฌธ์žฅ ์ˆœ์„œ ์˜ˆ์ธก(SOP, Sentence Order Predition) ์‚ฌ์šฉ
        • ์‚ฌ์ „ํ•™์Šต์— NSP๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์‹ค์ œ๋กœ๋Š” ์œ ์šฉํ•˜์ง€ ์•Š๊ณ , MLM ๋Œ€๋น„ ๋‚œ์ด๋„๊ฐ€ ๋†’์ง€ ์•Š์Œ
        • NSP๋Š” ์ฃผ์ œ์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ณผ ๋ฌธ์žฅ์˜ ์ผ๊ด€์„ฑ์— ๋Œ€ํ•œ ์˜ˆ์ธก์„ ํ•˜๋‚˜์ด ํ•˜๋‚˜์˜ ์ž‘์—…์œผ๋กœ ๊ฒฐํ•ฉ๋˜์–ด ์žˆใ…‡๋ฏ€
        • SOP๋Š” ์ฃผ์ œ์˜ ์˜ˆ์ธก์ด ์•„๋‹ˆ๋ผ ๋ฌธ์žฅ ๊ฐ„์˜ ์ผ๊ด€์„ฑ ๊ณ ๋ ค
      • ๋ฌธ์ž ์ˆœ์„œ ์˜ˆ์ธก
        • ์ด์ง„ ๋ถ„๋ฅ˜ task, ์ฃผ์–ด์ง„ ํ•œ ์Œ์˜ ๋ฌธ์žฅ์ด ๋ฌธ์žฅ ์ˆœ์„œ๊ฐ€ ๋ฐ”๋€Œ์—ˆ๋Š”์ง€ ์—ฌ๋ถ€(positive/negative)๋ฅผ ํŒ๋‹จ(NSP๋Š” ํ•œ์Œ์˜ ๋ฌธ์žฅ์˜ isNext, notNext์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” ํ˜•ํƒœ๋กœ ํ•™์Šต)
    • BERT์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ALBERT model์œผ ๊ฐ€์ง€๊ณ  fine tuning์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ. BERT ๋Œ€์•ˆ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ข‹์œผ model
  • RoBERTa
    • BERT๊ฐ€ ์ถฉ๋ถ„ํžˆ ํ•™์Šต๋˜์ง€ ์•Š์Œ์„ ํ™•์ธํ•˜๊ณ , BERT model ์‚ฌ์ „์‹œ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ• ์‚ฌ์šฉ
      • MLM task์—์„œ ์ •์  masking์ด ์•„๋‹Œ ๋™์  masking์‚ฌ์šฉ
      • NSP task์ œ๊ฑฐํ•˜๊ณ  MLM task๋งŒ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต
      • ๋” ๋งŽ์€ ํ•™์Šต data ์‚ฌ์šฉ
        • BERT์—์„œ ์‚ฌ์šฉํ•œ ํ† ๋ก ํ†  ์ฑ… ๋ง๋ญ‰์น˜์™€ ์˜์–ด wikipedia์™ธ CC-News, Open WebText, Stories๋ฅผ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉ (BERT:16GB, RoBERTa:160GB)
      • Batch size ์ฆ๊ฐ€ํ•˜์—ฌ ํ•™์Šต
        • BERT: 256 batch๋กœ 100๋งŒ ๋‹จ๊ณ„๋™์•ˆ ์‚ฌ์ „ํ•™์Šต, RoBERTa: 8000 batch๋กœ 30๋งŒ ๋‹จ๊ณ„ ๋˜๋Š” 50๋งŒ ๋‹จ๊ณ„ ๋™์•ˆ ์‚ฌ์ „ํ•™์Šต
        • ํ•™์Šต batch size๋ฅด ์ฆ๊ฐ€ํ•˜์—ฌ ํ•™์Šต์†๋„๋ฅผ ๋†’์ด๊ณ  model ์„ฑ๋Šฅ ํ–ฅ์ƒ
      • BBPE tokenizer ์‚ฌ์šฉ
  • ELECTRA(Effiicently Learning an Encoder that Classifies Token Replacement Accurately)
    • MLM๊ณผ NSP๋Œ€์‹  ๊ต์ฒด ํ† ํฐ ํƒ์ง€(Replaced token detection) task๋กœ ํ•™์Šต
    • ๊ต์ฒด ํ† ํฐ ํƒ์ง€
      • Masking ๋Œ€์ƒ์ธ token์„ ๋‹ค๋ฅธ token์œผ๋กœ ๋ณ€๊ฒฝํ•œ ํ›„ ์ด token์ด ์‹ค์ œ token์ธ์ง€ ๊ต์ฒดํ•œ token์ธ์ง€ ํŒ๋ณ„
      • MLM task์˜ ๋ฌธ์ œ์ค‘ ํ•˜๋‚˜๋Š” ์‚ฌ์ „ ํ•™์Šต์ค‘ [MASK] token์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ fine tuning task์ค‘์—๋Š” [MASK] token์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„, ์‚ฌ์ „ํ•™์Šต๊ณผ fine tuning์‹œ token์— ๋Œ€ํ•œ ๋ถˆ์ผ์น˜๊ฐ€ ์ƒ๊ธธ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž„ ELECTRA
        1. Token์„ ๋ฌด์ž‘์œ„๋กœ maskingํ•˜๊ณ  ์ƒ์„ฑ์ž์— ์ž…๋ ฅ
        2. ์ž…๋ ฅ token์„ ์ƒ์„ฑ์ž์— ์˜ํ•ด ์ƒ์„ฑํ•œ token์œผ๋กœ ๊ต์ฒดํ•˜๊ณ  ์ด๋ฅผ ํŒ๋ณ„์ž์— ์ž…๋ ฅ
        3. ํŒ๋ณ„์ž๋Š” ์ฃผ์–ด์ง„ token์ด ์›๋ณธ์ธ์ง€ ์•„๋‹Œ์ง€ ํŒ๋‹จ
        4. ํ•™์Šตํ•œ ํ›„ ์ƒ์„ฑ์ž๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํŒ๋ณ„์ž๋ฅผ ELECTRA model๋กœ ์‚ฌ์šฉ
    • BETR์˜ MLM task์˜ ๊ฒฝ์šฐ ์ „์ฒด token์˜ 15%๋งŒ masking ํ•œ ํ›„ ํ•™์Šตํ•˜๋ฏ€๋กœ 15% mask๋œ token์˜ˆ์ธก์„ ์ฃผ๋ชฉ์ ์œผ๋กœ ํ•˜์ง€๋งŒ, ELECTRA์˜ ๊ฒฝ์šฐ ์ฃผ์–ด์ง„ token์˜ ์›๋ณธ ์—ฌ๋ถ€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šตํ•˜๋ฏ€๋กœ ๋ชจ๋“  token์„ ๋Œ€์ƒ์œผ๋กœ ํ•™์Šต์ด ์ด๋ค„์ง
  • SpanBERT
    • Text๋ฒ”์œ„๋ฅผ ์—์ธกํ•˜๋Š” ์งˆ๋ฌธ-์‘๋‹ต๊ณผ ๊ฐ™์€ task์— ์ฃผ๋กœ ์‚ฌ์šฉ
    • Token์„ ๋ฌด์ž‘์œ„๋กœ maskingํ•˜๋Š” ๋Œ€์‹ ์— token์˜ ์—ฐ์†๋œ ๋ฒ”์œ„๋ฅผ ๋ฌด์ž‘์œ„๋กœ masking
    • MLM๊ณผ SBO(Span Boundary Objective)๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šต
      • SBO๋Š” mask๋œ token์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•ด๋‹นํ•˜๋Š” mask๋œ token์„ ํ‘œํ˜„์— ์‚ฌ์šฉํ•˜๋Š” ๋Œ€์‹ , span ๊ฒฝ๊ณ„์— ์žˆ๋Š” token์˜ ํ‘œํ˜„๋งŒ ์‚ฌ์šฉ. ๋˜ํ•œ masked๋œ token์˜ ์œ„์น˜ embedding๊ฐ’์„ ๊ฐ™์ด ์‚ฌ์šฉ. ์ด๋Š” mask๋œ token์˜ ์ƒ๋Œ€์  ์œ„์น˜
        SpanBERT
      • MLM์€ mask๋œ token์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๋‹น token์˜ ํ‘œํ˜„๋งŒ์„ ์‚ฌ์šฉํ•˜๊ณ , SBO์˜ ๊ฒฝ์šฐ mask๋œ token์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด span ๊ฒฝ๊ณ„ token๊ณผ mask๋œ token์˜ ์œ„์น˜ embedding ์ •๋ณด ์‚ฌ์šฉ

5. BERT ํŒŒ์ƒ model II: ์ง€์‹ ์ฆ๋ฅ˜ ๊ธฐ๋ฐ˜

  • ์‚ฌ์ „ ํ•™์Šต๋œ BERT๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ๋”ฐ๋ฅธ ๋ฌธ์ œ๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๊ณ  ์ œํ•œ๋œ resource๋กœ model์„ ์‹คํ–‰ํ•˜๊ธฐ๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€.
  • ์‚ฌ์ „ ํ•™์Šต๋œ BERT๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋งŽ๊ณ  ์ถ”๋ก ์— ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ ค ํœด๋Œ€ํฐ๊ณผ ๊ฐ™์€ edge device์—์„œ ์‚ฌ์šฉ์ด ๋” ์–ด๋ ค์›€
  • ์ด๋ฅผ ์œ„ํ•ด ์‚ฌ์ „ ํ•™์Šต๋œ ๋Œ€ํ˜• BERT์—์„œ ์†Œํ˜• BERT๋กœ ์ง€์‹์„ ์ด์ „ํ•˜๋Š” ์ง€์‹ ์ฆ๋ฅ˜ ์‚ฌ์šฉ
  • ์ง€์‹ ์ฆ๋ฅ˜(Knowledge distillation)
    • ์‚ฌ์ „ ํ•™์Šต๋œ ๋Œ€ํ˜• model์˜ ๋™์ž‘์„ ์žฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์†Œํ˜• model์„ ํ•™์Šต์‹œํ‚ค๋Š” model ์••์ถ• ๊ธฐ์ˆ 
    • ๊ต์‚ฌ-ํ•™์ƒ ํ•™์Šต(teacher-student learning)์ด๋ผ๋„ ํ•จ.(teacher: ์‚ฌ์ „ํ•™์Šต๋œ ๋Œ€ํ˜• model, student: ์†Œํ˜• model)
    • ๊ต์‚ฌnetwork
    • ์•”ํ‘ ์ง€์‹(dark knowledge): ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ network๊ฐ€ ๋ฐ˜ํ™˜ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์—์„œ ์ถ”์ถœํ•œ ๋‹ค๋ฅธ ์œ ์šฉํ•œ ์ •๋ณด
    • ๊ต์‚ฌnetwork
    • Softmax Temperatue
      • ์ถœ๋ ฅ layer์—์„œ softmax temperature๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ํ‰ํ™œํ™” ํ• ์ˆ˜ ์žˆ์Œ
      • Softmax Temperatue
      • T๊ฐ€ temperature, T=1์ธ ๊ฒฝ์šฐ ์ผ๋ฐ˜์ ์ธ softmaxํ•จ์ˆ˜. T์˜ ๊ฐ’์„ ๋Š˜๋ฆฌ๋ฉด ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ๋” ๋ถ€๋“œ๋Ÿฌ์›Œ์ง€๊ณ  ๋‹ค๋ฅธ class๋“ค์— ๋Œ€ํ•œ ๋” ๋งŽ์€ ์ •๋ณด ์ œ๊ณต๋จ
      • Softmax Temperature
      • ๊ฒฐ๊ณผ์ ์œผ๋กœ softmax temperature๋ฅผ ์‚ฌ์šฉํ•ด ์•”ํ‘ ์ง€์‹์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ. ์•”ํ‘ ์ง€์‹์„ ์–ป๊ธฐ ์œ„ํ•ด softmax temperature๋กœ ๊ต์‚ฌ network๋ฅผ ์‚ฌ์ „ํ•™์Šตํ•˜๊ณ , ์ง€์‹์ฆ๋ฅ˜๋ฅผ ํ†ตํ•ด ์•”ํ‘ ์ง€์‹์„ ๊ต์‚ฌ๋กœ๋ถ€ํ„ฐ ํ•™์ƒ์—๊ฒŒ ์ „๋‹ฌ
    • ํ•™์Šต network ํ•™์Šต
    • ๊ต์‚ฌ-ํ•™์ƒ network
      • ํ•™์ƒ network๋Š” ์‚ฌ์ „ ํ•™์Šต๋˜์ง€ ์•Š์•˜์œผ๋ฉฐ, ๊ต์‚ฌ network๋งŒ softmax temperature๋กœ ์‚ฌ์ „ํ•™์Šต๋จ
      • Soft target: ๊ต์‚ฌ network์˜ ์ถœ๋ ฅ, Soft prediction: ํ•™์ƒ network์—์„œ ๋งŒ๋“  ์˜ˆ์ธก
      • ์ฆ๋ฅ˜์†์‹ค(Distillation loss): Soft target๊ณผ Soft prediction์ƒ์˜ cross entropy ์†์‹ค
      • Hard target: label๋กœ ์ •๋‹ต์€ 1, ๋‹ค๋ฅธ ๊ฐ’๋“ค์€ ๋ชจ๋‘ 0
      • Hard prediction: Sottmax temperature=1์ธ ํ•™์ƒ network์—์„œ ์˜ˆ์ธกํ•œ ํ™•๋ฅ ๋ถ„ํฌ
      • ํ•™์ƒ์†์‹ค(Student loss): Hard target๊ณผ Hard prediction๊ฐ„์˜ cross entropy ์†์‹ค
      • ์ตœ์ข… ์†์‹ค์€ ํ•™์ƒ์†์‹ค๊ณผ ์ฆ๋ฅ˜์†์‹ค์˜ ๊ฐ€์ค‘ํ•ฉ๊ณ„
  • DistilBERT: BERT์˜ ์ง€์‹ ์ฆ๋ฅ˜ version
    • DistilBERT
      • ๊ต์‚ฌ BERT(BERT-base)์‚ฌ์ „ ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ dataset์œผ๋กœ ํ•™์ƒ BERT ํ•™์Šต
      • MLM task๋งŒ์œผ๋กœ ํ•™์Šต
      • ๋™์  masking ์‚ฌ์šฉ
      • ํฐ batch size ํ•™์Šต
      • ์ฆ๋ฅ˜์†์‹ค, ํ•™์ƒ์†์‹ค, Cosine Embedding ์†์‹ค ์‚ฌ์šฉ
        • Cosine Embedding ์†์‹ค: ๊ต์‚ฌ์™€ ํ•™์ƒ BERT๊ฐ€ ์ถœ๋ ฅํ•˜๋Š” vector์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ์ธก์ •. ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด ํ•™์ƒ embedding์„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ํ•˜๋ฉด์„œ๋„ ๊ต์‚ฌ embedding๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ํ‘œํ˜„ ๊ฐ€๋Šฅ
  • TinyBERT
    • TinyBERT teacher
    • ๊ต์‚ฌ์˜ ์ถœ๋ ฅ layer(์˜ˆ์ธก layer)์—์„œ ํ•™์ƒ์—๊ฒŒ ์ง€์‹์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ ์™ธ์— embedding ๋ฐ ์—ฌ๋Ÿฌ encoder์—์„œ ์ง€์‹ ์ „๋‹ฌํ•˜์—ฌ ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์Šต๋“
    • TinyBERT
    • TinyBERT ์ง€์‹ ์ฆ๋ฅ˜
      • Transformer layer(encoder layer)
        • ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ์ฆ๋ฅ˜(attention based distillation)
          • Attention ํ–‰๋ ฌ์— ๋Œ€ํ•œ ์ง€์‹์„ ๊ต์‚ฌ BERT์—์„œ ํ•™์ƒ BERT๋กœ ์ „์ด. Attention ํ–‰๋ ฌ์—๋Š” ์–ธ์–ด ๊ตฌ๋ฌธ, ์ƒํ˜ธ ์ฐธ์กฐ ์ •๋ณด๋“ฑ๊ณผ ๊ฐ™์€ ์œ ์šฉํ•œ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š”๋ฐ ์œ ์šฉ
          • ํ•™์ƒ Attention ํ–‰๋ ฌ๊ณผ ๊ต์‚ฌ BERT attention ํ–‰๋ ฅ๊ฐ„์˜ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•ด์„œ ํ•™์ƒ networkํ•™์Šต
          • Attention loss
          • Attention based distillation
        • ์€๋‹‰์ƒํƒœ ๊ธฐ๋ฐ˜ ์ฆ๋ฅ˜(hidden state based distillation)
          • ๊ต์‚ฌ์˜ ์€๋‹‰์ƒํƒœ์™€ ํ•™์ƒ์˜ ์€๋‹‰์ƒํƒœ ์‚ฌ์ด์˜ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•ด ์ฆ๋ฅ˜ ์ˆ˜ํ–‰ *Hidden state based distillation
      • Embedding layer(input layer)
        • ๊ต์‚ฌ์˜ embedding layer์—์„œ ํ•™์ƒ์˜ embedding layer๋กœ ์ง€์‹ ์ „๋‹ฌ
        • ํ•™์ƒ Embedding๊ณผ ๊ต์‚ฌ Embedding ์‚ฌ์ด์˜ ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ๋ฅผ ์ตœ์†Œํ™”ํ•ด์„œ Embedding layer์ฆ๋ฅ˜ ์ˆ˜ํ–‰
      • Prediction layer(output layer)
        • ๊ต์‚ฌ BERT๊ฐ€ ์ƒ์„ฑํ•œ ์ตœ์ข… ์ถœ๋ ฅ layer์˜ logi๊ฐ’์„ ํ•™์ƒ BERT๋กœ ์ „๋‹ฌํ•ด ์ง„ํ–‰. DistilBERT์˜ ์ฆ๋ฅ˜ ์†์‹ค๊ณผ ์œ ์‚ฌ
        • Soft target๊ณผ soft prediction ๊ฐ„์˜ cross entropy ์†์‹ค์„ ์ตœ์†Œํ™”ํ•ด์„œ ์˜ˆ์ธก layer์— ์ฆ๋ฅ˜ ์ˆ˜ํ–‰
    • TinyBERT ํ•™์Šต
      • ์ผ๋ฐ˜ ์ฆ๋ฅ˜

        • ์‚ฌ์ „ ํ•™์Šต๋œ ๋Œ€ํ˜• BERT(BERT-base)๋ฅผ ๊ต์‚ฌ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์ฆ๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•ด ์ง€์‹์„ ์ž‘์€ ํ•™์ƒ BERT(TinyBERT)๋กœ ์ „๋‹ฌ
        • ์ฆ๋ฅ˜ ํ›„ ํ•™์ƒ BERT๋Š” ๊ต์‚ฌ์˜ ์ง€์‹์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ์ด๋ ‡๊ฒŒ ์‚ฌ์ „ ํ•™์Šต๋œ ํ•™์ƒ BERT๋ฅผ ์ผ๋ฐ˜ TinyBERT๋ผ ํ•จ
      • ํƒœ์Šคํฌ ํŠนํ™” ์ฆ๋ฅ˜

        • ํŠน์ • task๋ฅผ ์œ„ํ•ด ์ผ๋ฐ˜ TinyBERT(์‚ฌ์ „ ํ•™์Šต๋œ TinyBERT)๋ฅผ fine tuning
        • DitilBERT์™€ ๋‹ฌ๋ฆฌ TinyBERT์—์„œ๋Š” ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ฆ๋ฅ˜๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ ์™ธ์— fine tuning ๋‹จ๊ณ„์—์„œ๋„ ์ฆ๋ฅ˜ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ
        • ์‚ฌ์ „ ํ•™์Šต๋œ BERT-base model์„ ์‚ฌ์šฉํ•ด ํŠน์ • task์— ๋งž๊ฒŒ fine tuningํ•œ ํ›„ ์ด๋ฅผ ๊ต์‚ฌ๋กœ ์‚ฌ์šฉ. ์ผ๋ฐ˜ TinyBERT๋Š” ํ•™์ƒ BERT
        • ์ฆ๋ฅ˜ ํ›„ ์ผ๋ฐ˜ TinyBERT๋Š” ๊ต์‚ฌ์˜ taskํŠนํ™”๋œ ์ง€์‹(fine tuning๋œ BERT-base)์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฏ€๋กœ ํŠน์ • task์— ๋Œ€ํ•ด fine tuning๋œ ์ผ๋ฐ˜ TinyBERT๋ฅผ fine tuning๋œ TinyBERT๋ผ ๋ถ€๋ฆ„
        ์ผ๋ฐ˜ ์ฆ๋ฅ˜(์‚ฌ์ „ํ•™์Šต) Task ํŠนํ™” ์ฆ๋ฅ˜(Fine tuning)
        ๊ต์‚ฌ ์‚ฌ์ „ ํ•™์Šต๋œ BERT-base Fine tuning๋œ BERT-base
        ํ•™์ƒ ์ž‘์€ BERT ์ผ๋ฐ˜ TinyBERT(์‚ฌ์ „ ํ•™์Šต๋œ TinyBERT)
        ๊ฒฐ๊ณผ ์ฆ๋ฅ˜ ํ›„ ํ•™์ƒ BERT๋Š” ๊ต์‚ฌ๋กœ๋ถ€ํ„ฐ ์ง€์‹์„ ์ „์ˆ˜ ๋ฐ›์Œ. ์ด๋Š” ๊ณง ์‚ฌ์ „ ํ•™์Šต๋œ ํ•™์ƒ BERT. ์ฆ‰ ์ผ๋ฐ˜ TinyBERT๋ผ ํ•จ ์ฆ๋ฅ˜ ํ›„ ์ผ๋ฐ˜ TinyBERT๋Š” ๊ต์‚ฌ๋กœ๋ถ€ํ„ฐ Task ํŠนํ™” ์ง€์‹์„ ์ „์ˆ˜ ๋ฐ›์Œ. ์ด๋Š” task ํŠนํ™” ์ง€์‹์œผ๋กœ fine tunning๋œ TinyBERT๋ผ ํ•จ
      • Fine tuning ๋‹จ๊ณ„์—์„œ ์ฆ๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋งŽ์€ task๋ณ„ dataset์ด ์š”๊ตฌ๋˜์–ด data ์ฆ์‹ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด dataset ํ™•๋ณดํ•จ

  • BERT์—์„œ ์‹ ๊ฒฝ๋ง์œผ๋กœ ์ง€์‹ ์ „๋‹ฌ

6. ํ…์ŠคํŠธ ์š”์•ฝ์„ ์œ„ํ•œ BERTSUM ํƒ์ƒ‰

  • Text์š”์•ฝ
    • ๊ธด text์˜ ๋ฌธ์„œ๋ฅผ ์งง์€ ๋ฌธ์žฅ์œผ๋กœ ์š”์•ฝํ•œ๋Š ๊ณผ์ •
      • ์ถ”์ถœ ์š”์•ฝ(extractive summarization)
        • ์ฃผ์–ด์ง„ text์—์„œ ์ค‘์š”ํ•œ ๋ฌธ์žฅ๋งŒ ์ถ”์ถœํ•ด ์š”์•ฝํ•˜๋Š” ๊ณ ์ •, ๋งŽ์€ ๋ฌธ์žฅ์ด ํฌํ•จ๋œ ๊ธด ๋ฌธ์„œ์—์„œ ๋ฌธ์„œ์˜ ๋ณธ์งˆ์ ์ธ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์žฅ๋งŒ ์ถ”์ถœํ•ด ๋ฌธ์„œ์˜ ์š”์•ฝ์„ ์ƒ์„ฑ
      • ์ƒ์„ฑ ์š”์•ฝ(abstractive summarization)
        • ์ฃผ์–ด์ง„ text๋ฅผ ์˜์—ญ(paraphrasing)ํ•ด ์š”์•ฝ์„ ์ƒ์„ฑ. ์˜์—ญ์ด๋ž€ text์˜ ์˜๋ฏธ๋ฅผ ์ข€ ๋” ๋ช…ํ™•ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ ๋‹ค๋ฅธ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•ด ์ฃผ์–ด์ง„ text๋ฅผ ์ƒˆ๋กญ๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ
        • ์ฃผ์–ด์ง„ text์˜ ์˜๋ฏธ๋งŒ ์ง€๋‹Œ ๋‹ค๋ฅธ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•ด ์ฃผ์–ด์ง„ text๋ฅผ ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์œผ๋กœ ํ‘œํ˜„
  • Text์š”์•ฝ์— ๋งž์ถ˜ BERT fine tuning
    • BERT๋ฅผ ํ™œ์šฉํ•œ ์ถ”์ถœ ์š”์•ฝ
      • BERT๋ฅผ ํ™œ์šฉํ•œ ์ถ”์ถœ ์š”์•ฝ
      • ๋ชจ๋“  ๋ฌธ์žฅ์˜ ์‹œ์ž‘ ๋ถ€๋ถ„์— [CLS] token์„ ์ถ”๊ฐ€ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— [CLS] token์— ๋Œ€ํ•œ ํ‘œํ˜„์„ ๊ฐ ๋ฌธ์žฅ์— ๋Œ€ํ•œ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
      • BERT model์„ ์‚ฌ์šฉํ•ด์„œ ์ž…๋ ฅ data ํ˜•์‹์„ ๋ณ€๊ฒฝํ•ด ํ‘œํ˜„ํ•œ model์„ BERTSUM์ด๋ผ ํ•จ
      • BERT๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๋Š” ๋Œ€์‹ , ์‚ฌ์ „ ํ•™์Šต๋œ BERT model์„ ์‚ฌ์šฉํ•˜๋˜ ์ž…๋ ฅ data ํ˜•ํƒœ๋ฅผ ๋ณ€๊ฒฝํ•ด์„œ ํ•™์Šต์‹œํ‚ค๋ฉด ๋ชจ๋“  [CLS] token์˜ ํ‘œํ˜„์„ ํ•ด๋‹น ๋ฌธ์žฅ์˜ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
      • ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์žˆ๋Š” BERTSUM
        • ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์žˆ๋Š” BERTSUM
    • BERT๋ฅผ ์‚ฌ์šฉํ•œ ์ƒ์„ฑ ์š”์•ฝ
      • ์ƒ์„ฑ ์š”์•ฝ์„ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ๋Š” transformer์˜ encoder-decoder architecture ์‚ฌ์šฉ
        • ์‚ฌ์ „ ํ•™์Šต๋œ BERTSUM์„ encoder๋กœ ํ™œ์šฉ
        • trasnformer model์€ encoder๊ฐ€ ์„œ์ „ ํ•™์Šต๋œ BERTSUM model์ด์ง€๋งŒ decoder๋Š” ๋ฌด์ž‘์œ„๋กœ ์ดˆ๊ธฐํ™”๋˜์–ด fine tuning์ค‘์— ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๊ณ , encoder๊ฐ€ ์‚ฌ์ „ํ•™์Šต ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ๋ ์ˆ˜ ์žˆ๊ณ , decoder๋Š” ๊ณผ์†Œ์ ํ•ฉ ๋ ์ˆ˜ ์žˆ์Œ
        • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Adam optimizer๋ฅผ encoder์™€ decoder์— ๊ฐ๊ฐ ์‚ฌ์šฉ. encoder์—๋Š” ํ•™์Šต๋ฅ ์„ ์ค„์ด๊ณ  ์ข€๋” ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๊ฐ์‡ ํ•˜๋„๋ก ์„ค์ •
  • ROUGE ํ‰๊ฐ€ ์ง€ํ‘œ ์ดํ•ดํ•˜๊ธฐ
    • ROUGE(Recall-Oriented Understudy for Gisting Evaluation): Text ์š”์•ฝ task์˜ ํ‰๊ฐ€ ์ง€ํ‘œ
    • ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, ROUGE-SU
    • ROUGE-N
      • ROUGE-N์€ ํ›„๋ณด ์š”์•ฝ(์˜ˆ์ธกํ•œ ์š”์•ฝ)๊ณผ ์ฐธ์กฐ ์š”์•ฝ(์‹ค์ œ ์š”์•ฝ)๊ฐ„์˜ n-gram ์žฌํ˜„์œจ(recall)
      • ์žฌํ˜„์œจ = ์„œ๋กœ ๊ฒน์น˜๋Š” n-gram์ˆ˜ / ์ฐธ์กฐ ์š”์•ฝ์˜ n-gram์ˆ˜
        • ํ›„๋ณด ์š”์•ฝ: Machine learning is seen as a subset of artificial intelligence.
        • ์ฐธ์กฐ ์š”์•ฝ: Machine learning is as subset of artificial intelligence.
      • ROUGE-1: ํ›„๋ณด ์š”์•ฝ(์˜ˆ์ธก ์š”์•ฝ)๊ณผ ์ฐธ์กฐ ์š”์•ฝ(์‹ค์ œ ์š”์•ฝ)๊ฐ„์˜ unigram ์žฌํ˜„์œจ
        • ํ›„๋ณด ์š”์•ฝ unigram: machine, learning, is, seen, as, a, subset, of, artificial, intelligence
        • ์ฐธ์กฐ ์š”์•ฝ unigram: machine, learning, is, as, subset, of, artificial, intelligence
        • ROUGE-1 = 8/8 =1
      • ROUGE-1: ํ›„๋ณด ์š”์•ฝ(์˜ˆ์ธก ์š”์•ฝ)๊ณผ ์ฐธ์กฐ ์š”์•ฝ(์‹ค์ œ ์š”์•ฝ)๊ฐ„์˜ bigram ์žฌํ˜„์œจ
        • ํ›„๋ณด ์š”์•ฝ bigram: (machine, learning), (learning, is), (is, seen), (seen, as), (as, a), (a, subset), (subset, of), (of, artificial), (artificial, intelligence)
        • ์ฐธ์กฐ ์š”์•ฝ bigram: (machine, learning), (learning, is), (is, a), (a, subset), (subset, of), (of, artificial), (artificial, intelligence)
        • ROUGE-2 = 6/7 = 0.85
    • ROUGE-L
      • ๊ฐ€์žฅ ๊ธด ๊ณตํ†ต ์‹œํ€€์Šค(LCS, longest common subsequence)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ. ๋‘ sequence ์‚ฌ์ด์˜ LCS๋ž€ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ๊ฐ€์ง€๋Š” ๊ณตํ†ต ํ•˜์œ„ sequence๋ฅผ ๋งํ•จ
      • ํ›„๋ณด ๋ฐ ์ฐธ์กฐ ์š”์•ฝ์— LCS๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์€ ํ›„๋ณด ์š”์•ฝ๊ณผ ์ฐธ์กฐ ์š”์•ฝ์ด ์ผ์น˜ํ•˜๋Š” ๊ฒƒ
      • ROUGE-L์€ F-measure๋ฅผ ์‚ฌ์šฉํ•ด ์ธก์ •

7. ๋‹ค๋ฅธ ์–ธ์–ด์— BERT ์ ์šฉํ•˜๊ธฐ

  • M-BERT ์ดํ•ดํ•˜๊ธฐ
    • M-BERT๋Š” ์˜์–ด๋ฅผ ํฌํ•จํ•œ ๋‹ค๋ฅธ ์–ธ์–ด๋“ค์˜ ํ‘œํ˜„์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ
    • M-BERT๋Š” MLM๊ณผ NSP๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์˜์–ด wikipedia๋ฟ ์•„๋‹ˆ๋ผ 104๊ฐœ ์–ธ์–ด์˜ wikipedia๋กœ ํ•™์Šต๋จ (์–ธ์–ด๋ณ„ ๋น„์ค‘์ด ๋‹ค๋ฅด๋ฏ€๋กœ sampling ๋ฐฉ๋ฒ• ์ด์šฉ, ์ž๋ฃŒ๊ฐ€ ๋งŽ์€ ์–ธ์–ด๋Š” under sampling, ์ ์€ sample์€ over sampling)
    • M-BERT๋Š” ํŠน์ • ์–ธ์–ด ์Œ์ด๋‚˜ ์–ธ์–ด ์ •๋ ฌ์ด ๋˜์–ด ์žˆ๋Š” ํ•™์Šต data์—†์ด๋„ ๋‹ค๋ฅธ ์–ธ์–ด๋“ค๋กœ๋ถ€ํ„ฐ context ์ดํ•ด => ๊ต์ฐจ ์–ธ์–ด๋ฅผ ๊ณ ๋ คํ•œ ๋ชฉํ‘œ ํ•จ์ˆ˜ ์—†์ด M-BERT ํ•™์Šต์‹œํ‚จ ๊ฒƒ์ด ์ค‘์š”!!
    • ํŠน์ง•
      • M-BERT์˜ ์ผ๋ฐ˜ํ™” ๊ธฐ๋Šฅ์„ฑ์€ ์–ดํœ˜ ์ค‘๋ณต์— ์˜์กดํ•˜์ง€ ์•Š์Œ
      • M-BERT์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์€ ์œ ํ˜•ํ•™ ๋ฐ ์–ธ์–ด ์œ ์‚ฌ์„ฑ์— ๋”ฐ๋ผ ๋‹ค๋ฆ„
      • M-BERT๋Š” code switching text๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์Œ์ฐจ text๋Š” ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Œ
  • XLM(Cross-language Language Model)
    • ๋‹ค๊ตญ์–ด ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šต๋œ BERT. XLM์€ M-BERT๋ณด๋‹ค ๋‹ค๊ตญ์–ด ํ‘œํ˜„ ํ•™์Šต์„ ํ•  ๋•Œ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ
    • ๋‹จ์ผ ์–ธ์–ด dataset ๋ฐ ๋ณ‘๋ ฌ ์–ธ์–ด dataset(๊ต์ฐจ ์–ธ์–ด dadtaset) ์‚ฌ์šฉ
    • ํ•™์Šต ๋ฐฉ๋ฒ•
      • XLM
      • ์ธ๊ณผ ์–ธ์–ด ๋ชจ๋ธ๋ง(CLM, Casual Language Modeling)
        • ์ฃผ์–ด์ง„ ์ด์ „ ๋‹จ์–ด set์—์„œ ํ˜„์žฌ ๋‹จ์–ด์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธก
      • ๋งˆ์Šคํฌ ์–ธ์–ด ๋ชจ๋ธ๋ง(MLM, Masked Language Modeling)
        • Token์˜ 15%๋ฅผ masking(80-10-10๊ทœ์น™ ์ ์šฉ)ํ•˜๊ณ  mask๋œ token์„ ์˜ˆ์ธก
      • ๋ฒˆ์—ญ ์–ธ์–ด ๋ชจ๋ธ๋ง(TLM, Translation Language Modeling)
        • ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ์–ธ์–ด๋กœ์„œ ๋™์ผํ•œ text๋กœ ๊ตฌ์„ฑ๋œ ๋ณ‘๋ ฌ ๊ต์ฐจ ์–ธ์–ด data๋ฅผ ์ด์šฉํ•ด ํ•™์Šต
        • ์„œ๋กœ ๋‹ค๋ฅธ ์–ธ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ ์–ธ์–ด embedding์„ ์‚ฌ์šฉํ•˜๊ณ , ๋‘ ๋ฌธ์žฅ ๋ชจ๋‘ ๋ณ„๋„์˜ ์œ„์น˜ embedding์„ ์‚ฌ์šฉ.
    • XLM ์‚ฌ์ „ํ•™์Šต
      • CLM์‚ฌ์šฉ
      • MLM์‚ฌ์šฉ
      • TLM๊ณผ ๊ฒฐํ•ฉํ•ด MLM์‚ฌ์šฉ
      • CLM, MLM์„ ์‚ฌ์šฉํ•ด XLM์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒฝ์šฐ ๋‹จ์ผ ์–ธ์–ด dataset์„ ์‚ฌ์šฉ.
      • TLM์˜ ๊ฒฝ์šฐ ๋ณ‘๋ ฌ dataset์„ ์‚ฌ์šฉ
      • MLM๊ณผ TLM์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ MLM๊ณผ TML์œผ๋กœ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๋ณ€๊ฒฝ
    • ์‚ฌ์ „ ํ•™์Šต๋œ XLM์„ ์ง์ ‘ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ BERT์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ downstream task์—์„œ fine tuning ํ•  ์ˆ˜ ์žˆ์Œ
  • XLM-R ์ดํ•ดํ•˜๊ธฐ
    • XLM-RoBERTa๋กœ ๊ต์ฐจ ์–ธ์–ด ํ‘œํ˜„ ํ•™์Šต์„ ์œ„ํ•œ SOTA ๊ธฐ์ˆ 
    • ์„ฑ๋Šฅํ–ฅ์ƒ์„ ์œ„ํ•ด XLM์—์„œ ๋ช‡๊ฐ€์ง€๋ฅผ ๋ณด์™„ํ•œ ํ™•์žฅ version
      • MLM๋งŒ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ณ  TLM์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ. ์ฆ‰ ๋‹จ์ผ ์–ธ์–ด dataset๋งŒ ํ•„์š”
      • 2.5TB์˜ Common crawl dataset ์‚ฌ์šฉ
  • ์–ธ์–ด๋ณ„ BERT

8. Sentence-BERT ๋ฐ Domain-BERT

  • sentence-BERT๋กœ ๋ฌธ์žฅ ํ‘œํ˜„ ๋ฐฐ์šฐ๊ธฐ
    • ๊ณ ์ • ๊ธธ์ด์˜ ๋ฌธ์žฅ ํ‘œํ˜„์„ ์–ป๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ BERT ๋˜๋Š” ํŒŒ์ƒ model
    • vanilla BERT๋„ ๋ฌธ์žฅ ํ‘œํ˜„์„ ์–ป์„ ์ˆ˜ ์žˆ์ง€๋งŒ ์ถ”๋ก  ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ ค ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ณ  ํ•œ model
    • ๋ฌธ์žฅ ์Œ ๋ถ„๋ฅ˜์™€ ๋‘ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ๋“ฑ์— ๋„๋ฆฌ ์‚ฌ์šฉ
    • CLS
      • [CLS] token์˜ ํ‘œํ˜„์„ ๋ฌธ์žฅ ํ‘œํ˜„์œผ๋กœ ์‚ฌ์šฉํ•  ๋•Œ์˜ ๋ฌธ์ œ์ ์€ ํŠนํžˆ fine tuning ์—†์ด ์‚ฌ์ „ ํ•™์Šต๋œ BERT๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ [CLS] token์˜ ๋ฌธ์žฅ ํ‘œํ˜„์ด ์ •ํ™•ํ•˜์ง€ ์•Š์Œ
    • Pooling
      • ๋ชจ๋“  token์˜ ํ‘œํ˜„์„ poolingํ•ด ๋ฌธ์žฅ ํ‘œํ˜„์„ ๊ณ„์‚ฐ
        • ํ‰๊ท  pooling: ๋ชจ๋“  ๋‹จ์–ด(token)์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง
        • ์ตœ๋Œ€ pooling: ์ค‘์š”ํ•œ ๋‹จ์–ด(token)์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง
  • sentence-BERT ์ดํ•ดํ•˜๊ธฐ
    • ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค์ง€ ์•Š๊ณ  ์‚ฌ์ „ ํ•™์Šต๋œ BERT(๋˜๋Š” ํŒŒ์ƒmodel)์„ ์„ ํƒํ•ด ๋ฌธ์žฅ ํ‘œํ˜„์„ ์–ป๋„๋ก fine tuningํ•จ
    • ์ฆ‰, sentence-BERT๋Š” ๋ฌธ์žฅ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด fine tuning๋œ BERT model
    • ์ƒด ๋„คํŠธ์›Œํฌ
      • ๋ฌธ์žฅ ์Œ ๋ถ„๋ฅ˜ task: ๋‘ ๋ฌธ์žฅ์ด ์œ ์‚ฌํ•œ์ง€ ์•„๋‹Œ์ง€๋ฅผ ๋ถ„๋ฅ˜
      • sentence-BERT
      • ๋ฌธ์žฅ ์Œ ํšŒ๊ท€ task: ๋‘ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ์˜๋ฏธ ์œ ์‚ฌ๋„ ์˜ˆ์ธก
      • sentence-BERT
    • ํŠธ๋ฆฌํ”Œ๋ › ๋„คํŠธ์›Œํฌ
      • ๊ธฐ์ค€๋ฌธ๊ณผ ๊ธ์ •๋ฌธ ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋†’์•„์•ผ ํ•˜๊ณ  ๊ธฐ์ค€๋ฌธ๊ณผ ๋ถ€์ •๋ฌธ ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์•„์•ผ ํ•˜๋Š” ํ‘œํ˜„ ๊ณ„์‚ฐ
      • sentence-BERT
  • domain-BERT
    • ํŠน์ • domain ๋ง๋ญ‰์น˜์— ํ•™์Šต์‹œํ‚จ BERT

9. VideoBERT, BART

  • VideoBERT๋กœ ์–ธ์–ด ๋ฐ ๋น„๋””์˜ค ํ‘œํ˜„ ํ•™์Šต
  • VideoBERT๋Š” ์–ธ์–ด ํ‘œํ˜„ ํ•™์Šต๊ณผ ๋™์‹œ์— video ํ‘œํ˜„๋„ ํ•™์Šต. Image caption์ƒ์„ฑ, Video caption, video์˜ ๋‹ค์Œ frame ์˜ˆ์ธก๋“ฑ์— ์‚ฌ์šฉ
  • ์‚ฌ์ „ํ•™์Šต
    • ์‚ฌ์ „ํ•™์Šต์—๋Š” ๊ต์œก์šฉ video ์‚ฌ์šฉ. video์—์„œ ์–ธ์–ด token๊ณผ ์‹œ๊ฐ token ์ถ”์ถœ
    • VideoBERT
    • cloze task
    • ์–ธ์–ด-์‹œ๊ฐ(linguistic-visual) ์ •๋ ฌ
      • ์–ธ์–ด์™€ ์‹œ๊ฐ token์ด ์‹œ๊ฐ„์ ์œผ๋กœ ์ •๋ ฌ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ์˜ˆ์ธก, ์ฆ‰ text(์–ธ์–ด token)์ด video(์‹œ๊ฐ์  token)๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก
      • [CLS] token์˜ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•ด ์–ธ์–ด์™€ ์‹œ๊ฐ token์ด ์„œ๋กœ ์ •๋ ฌ๋˜๋Š”์ง€ ์˜ˆ์ธก
  • VideoBERT ์‘์šฉ
    • VideoBERT
    • ๋‹ค์Œ ์‹œ๊ฐ token ์˜ˆ์ธก
      • ์‹œ๊ฐ token์„ ์ž…๋ ฅํ•ด ์ƒ์œ„ 3๊ฐœ์˜ ๋‹ค์Œ ์‹œ๊ฐ token ์˜ˆ์ธก
    • Text-Video ์ƒ์„ฑ
      • Text๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ํ•ด๋‹นํ•˜๋Š” ์‹œ๊ฐ token ์ƒ์„ฑ
    • video ์ž๋ง‰
      • Video๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์ž๋ง‰ ์ƒ์„ฑ
  • BART ์ดํ•ดํ•˜๊ธฐ
    • Facebook AI์—์„œ ๋„์ž…ํ•œ transformer architecture๊ธฐ๋ฐ˜์˜ noise ์ œ๊ฑฐ autoencoder
    • ์†์ƒ๋œ text๋ฅผ ์žฌ๊ตฌ์„ฑํ•ด ํ•™์Šต
    • ์‚ฌ์ „ ํ•™์Šต๋œ BART๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ downstream task์— fine tuning ๊ฐ€๋Šฅ.
    • Text ์ƒ์„ฑ์— ๊ฐ€์žฅ ์ ํ•จ
    • RoBERTa์™€ ๋น„์Šทํ•œ ์„ฑ๋Šฅ
    • ๊ตฌ์กฐ
      • BART
      • Encoder์™€ decoder๊ฐ€ ์žˆ๋Š” transformer model
      • ์†์ƒ๋œ text๋ฅผ encoder์ž…๋ ฅ -> encoder๋ฅผ ์ฃผ์–ด์ง„ text ํ‘œํ˜„ ํ•™์Šต ํ›„ decoder๋กœ ์ „๋‹ฌ -> decoder๋Š” encoder๊ฐ€ ์ƒ์„ฑํ•œ ํ‘œํ˜„์„ ๊ฐ€์ ธ์™€ ์†์ƒ๋˜์ง€ ์•Š์€ ์›๋ณธ text ์žฌ๊ตฌ์„ฑ
      • Encoder๋Š” ์–‘๋ฐฉํ–ฅ, decoder๋Š” ๋‹จ๋ฐฉํ–ฅ
      • ๋ณต์›์†์‹ค, ์›๋ณธ text์™€ decoder๊ฐ€ ์ƒ์„ฑํ•œ text์‚ฌ์ด์˜ cross entropy ์†์‹ค์„ ์ตœ์†Œํ™” ํ•˜๋„๋ก ํ•™์Šต
    • Noising ๊ธฐ์ˆ 
      • Noising
      • Token masking: ๋ช‡ ๊ฐœ์˜ token์„ ๋ฌด์ž‘์œ„๋กœ masking
      • Token Deletion: ์ผ๋ถ€ token์„ ๋ฌด์ž‘์œ„๋กœ ์‚ญ์ œ
      • Text Infilling: ๋‹จ์ผ [MASK] token์œผ๋กœ ์—ฐ์†๋œ token set๋ฅผ masking
      • Sentence Permutation: ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ž์Œ
      • Document Rotation: ์ฃผ์–ด์ง„ ๋ฌธ์„œ์—์„œ ๋ฌธ์„œ์˜ ์‹œ์ž‘์ด ๋  ์ˆ˜ ์žˆ๋Š” ํŠน์ • ๋‹จ์–ด(token)์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•œ ํ›„ ์„ ํƒํ•œ ๋‹จ์–ด ์•ž์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋ฌธ์„œ ๋์— ์ถ”๊ฐ€

10. ํ•œ๊ตญ์–ด ์–ธ์–ด model: KoBERT, KoGPT2, KoBART

  • KoBERT
  • KoGPT2
  • KoBART
โš ๏ธ **GitHub.com Fallback** โš ๏ธ