2021 09 24 - KR-HappyFace/meetup-logs GitHub Wiki

2021-09-24 ํ”ผ์–ด์„ธ์…˜ ํšŒ์˜๋ก

Electra vs BERT์—์„œ, ํŒŒ๋ผ๋ฉ”ํ„ฐ ์ˆ˜์— ์žˆ์–ด์„œ Electra๊ฐ€ ๋” ๊ฐ€๋ณ๋‹ค๊ณ  ํ•œ๋‹คโ€ฆ ์™œ?

ELECTRA is a new pretraining approach which trains two transformer models: the generator and the discriminator. The generatorโ€™s role is to replace tokens in a sequence, and is therefore trained as a masked language model. The discriminator, which is the model weโ€™re interested in, tries to identify which tokens were replaced by the generator in the sequence.

Masked language modeling (MLM) pretraining methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pretraining task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pretraining task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out.
  • Generator๋„ train์‹œํ‚ค๋Š” ๊ฑด๊ฐ€ ์•„๋‹ˆ๋ฉด discriminator๋งŒ train์‹œํ‚ค๋Š” ๊ฑด๊ฐ€? ์ด๊ฑด ๋ฉ˜ํ† ๋‹˜ํ•œํ…Œ ์—ฌ์ญค๋ณด์ž.

CoLA ๋ชจ๋ธ Train์‹œํ‚ฌ ๋•Œ

  • Train์„ ํ•  ๋•Œ evaluation loss๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ๋‹คโ€ฆ
  • Matthewโ€™s correlation์ด ์ค‘๊ฐ„์— ํŠ€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์—ˆ๋‹ค?

Positional Embedding์„ ๋”ฐ๋กœ ์ž…๋ ฅํ•ด์•ผ ํ•˜๋‚˜?

  • WPE๊ฐ€ ์ด๋ฏธ ๋ชจ๋ธ์—์„œ ์žˆ๋Š” ๊ฑธ ๋ณด๊ณ  ๋”ฐ๋กœ ์ž…๋ ฅํ•  ํ•„์š” ์—†๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Œ.

Black Formatter

  • 76์ž ์ด์ƒ ๋„˜์–ด๊ฐ€๋ฉด ๋‹ค์Œ ์ค„๋กœ ๋„˜์–ด๊ฐ„๋‹ค๊ณ  ์•Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

GPT-2 vs GPT-3? Baseline ๋ชจ๋ธ์€ ๋ฌด์—‡์ธ๊ฐ€?

  • ์ˆ˜์—…์—์„œ๋Š” self attention block, batch size๋ฅผ ๋” ์Œ“์•˜๋‹ค๊ณ  ํ•จ.
  • ๊ทผ๋ฐ SKT config.json์—์„œ๋Š” GPT-2๋ผ๊ณ  ํ•˜๋Š”๋ฐโ€ฆ ๊ตฌ์กฐ๊ฐ€ ๋น„์Šทํ•ด์„œ ์ƒ๊ด€ ์—†๋Š” ๊ฑด๊ฐ€?

GPT-3 baseline ๋ชจ๋ธ ์‚ฌ์šฉ ์—ฌ๋ถ€?

  • ๊ทผ๋ฐ GPT-3 Baseline์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ๋งž๋‚˜ ์‹ถ์Œ. ๋„ˆ๋ฌด ์˜ค๋ž˜ ๊ฑธ๋ ค. ํ•œ epoch์— 3์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๋”๋ผ๊ณ .
  • Baseline ๋ชจ๋ธ์€ Decoder์—ฌ์„œ ๋‹ค์Œ ๋‹จ์–ด ํ•™์Šตํ•˜๋Š” ๊ฑฐ๋ผ์„œ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Œ. WiC Task๋ž‘ ํฌ๊ฒŒ ๊ด€๋ จ ์—†๋Š” ๊ฒƒ ๊ฐ™์•„์„œ. ์ €๊ฑธ hidden state๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ์—๋„ ์• ๋งคํ•˜๊ณ .
  • ์ธ๊ณผ์ถ”๋ก ์—๋Š” ๊ดœ์ฐฎ์„ ์ˆ˜๋„ ์žˆ๋Š”๋ฐ, ๋™ํ˜•์ด์˜์–ด ํŒ๋‹จ์—๋Š” ๋ณ„๋กœ์ธ ๋“ฏ.

๊ทผ๋ฐ ์„ฑ๋Šฅ์€ ๋ชจ๋“  ๋ฉด์—์„œ outperformํ•˜๋Š” ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•˜๊ณ โ€ฆ (BoolQ, CoPA, WiC)

๋ชจ๋ธ์— ์ž…๋ ฅํ•  ๋•Œ Padding์„ ์ž…๋ ฅ์‹œ์ผœ์„œ ํ•™์Šต์„ ์‹œ์ผœ์•ผ ํ•˜๋‚˜์š”?

Batch size ๋ผ๋ฆฌ๋Š” Padding ๋„ฃ์–ด์„œ ๊ธธ์ด ๋งž์ถ”๋Š” ๊ฒŒ ๋งž๋Š” ๊ฒƒ ๊ฐ™์•„์š”. Tokenizer์— paddingํ•˜๋Š” ๊ฑด ๋ฌธ์žฅ์—์„œ ์ œ์ผ ๊ธด ๊ฑธ ๊ธฐ์ค€์œผ๋กœ ์ฑ„์šฐ๋Š” ๊ฒƒ ๊ฐ™์€๋ฐ. Train ์‹œํ‚ฌ ๋•Œ batch ๋‹จ๊ณ„์—์„œ ๊ธธ์ด๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์•„์š”.

  • ๋ฌธ์žฅ ๊ธธ์ด๊ฐ€ ๊ณ„์† ๋‹ฌ๋ผ์ง€์ž–์•„์š”. Input feature์„ ์–ด๋–ป๊ฒŒ ์žก์•„์•ผ ํ•  ์ง€ ๋ชจ๋ฅด๊ฒ ๋”๋ผ๊ณ ์š”. Hidden state๊ฐ€ token ๋ณ„๋กœ ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ์žˆ์„ํ…๋ฐ. ๋ฌธ์žฅ ๋ณ„๋กœ ๊ธธ์ด๊ฐ€ ๋‹ฌ์•„์งˆํ…๋ฐ.

-> Electra๋กœ ํ•˜๋ฉด ๋งจ์•ž์— ๋ถ€๋ถ„์ด class token์œผ๋กœ tokenize๊ฐ€ ๋ผ์š”. ๊ทธ ๋‹ค์Œ์— ๋ฌธ์žฅ์ด ์ด์–ด์ง€๊ฑฐ๋“ ์š”? Classification head๋กœ ๊ฐ€๋ณด๋‹ˆ๊นŒ. 0๋ฒˆ์งธ๊ฐ€ ๊ฐ’๋งŒ ๊ฐ€์ง€๊ณ  ๊ทธ๋ ‡๊ฒŒ ์ถœ๋ ฅ์ด ๋˜๋”๋ผ๊ณ ์š”.

์ฃผ๋ง ๋™์•ˆ์— ์–ด๋–ป๊ฒŒ ํ•˜๋Š” ๊ฒŒ ์ข‹์„๊นŒ?: CoLA๋งŒ ํ•ฉ์‹œ๋‹ค.

  • ์˜์ง„
    • HanBERT vs KoBART vs KoElectra
    • CoLA์—์„œ Fine tuningํ•˜๋Š” ๋ฐฉ์‹์„ ์—ฌ๋Ÿฌ ๊ฐœ ํ•ด๋ณผ๊ฒŒ์š”
    • train์ด๋ž‘ dev๋ฅผ ํ•ฉ์ณ์„œ ํ•œ๋‹ค๋“ ์ง€
  • ์ค€ํ™: CoLA์—์„œ ๋ชจ๋ธ ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉํ•ด๋ณด๊ณ  ์‹ถ์–ด์„œโ€ฆ ๊ทผ๋ฐ freeze ์‹œํ‚ค๋Š” ํšจ์šฉ์„ฑ์ด ์žˆ์„์ง€ ์—†์„์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ. Classifier๋งŒ train์‹œ์ผœ์•ผ ํ•˜๋‚˜? NLP์— ์ต์ˆ™ํ•ด์ง€๋Š” ๊ฒŒ ์ตœ์ข… ๋ชฉํ‘œ.
  • ์„ฑ์šฑ: CoLA ๋ง๊ณ  ๋‹ค๋ฅธ Task๋“ค์€ ๋งˆ์ง€๋ง‰ layer์„ ์ˆ˜์ •์„ ํ•ด์•ผ ํ•˜๋Š” ๊ฑด๊ฐ€์š”? ๋งˆ์ง€๋ง‰ Layer ๊ต์ฒดํ•˜๋Š” ๊ฒƒ ๊ฐ™๊ณ  ์‹คํ—˜์„ ํ•ด๋ณผ ๊ฒƒ ๊ฐ™์Œ.
    • ์žฌ์˜: Automodel์—์„œ head๋ฅผ ๋ถ™์ด๋Š” ๊ฑธ ๋ดค๊ฑฐ๋“ ์š”. Automodel class์—์„œ sequence classification ๋“ฑ ๋‹ค๋ฅธ head๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ๊ฐ€ ๋ถ™์–ด์žˆ๋Š”๋ฐ ๊ทธ๊ฑธ ์กฐ์ ˆํ•ด์„œ output์— ๋งˆ์ง€๋ง‰์— ์ˆ˜์ •ํ•ด์„œ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ์—ฐ์ฃผ: ์ „๋ฐ˜์ ์œผ๋กœ NLP ๊ณต๋ถ€ํ•˜๊ณ  CoLA์— ํ•˜๋‚˜์”ฉ ์ ์šฉ
  • ํ˜„์ˆ˜: CoLA๋ฅผ ๋Œ๋ฆด ๋•Œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ ๋Œ๋ฆฌ๋ฉด์„œ ์‹คํ—˜ํ•˜๋Š” ์ค‘
  • ์„ธํ˜„: ์›๋ž˜๋Š” WiC ์ข€ ๋” ๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ์ผ๋‹จ ํ†ต์ผํ•˜์ž๊ณ  ํ•˜์…จ์œผ๋‹ˆโ€ฆ ๋ชจ๋ธ์„ ์–ด๋–ค ๊ฑฐ ์จ์•ผํ•  ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋„ค์š”. ์‹คํ—˜์„ ํ•ด์•ผ ํ•  ๊ฒƒ ๊ฐ™์•„์š”. Task๋ฅผ ๋”ฑ ์ •ํ•˜๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” stage๊ฐ€ ์•„๋‹ ๊ฒƒ ๊ฐ™๋‹ค๋Š” ๋Š๋‚Œ์ด ๋“œ๋„ค์š”. CoLA ๋ฐ์ดํ„ฐ๋ฅผ ์•„์ง ์•ˆ ๋‹ค๋ค„๋ด์„œ ์‹คํ—˜ํ•˜๋Š” ๋‹จ๊ณ„๋กœ ใ…Žใ…Ž
  • ์žฌ์˜: ์ €๋„ ๋ง‰ ์˜จ๊ฐ– ๋ชจ๋ธ ์ ์šฉํ•ด๋ณด๋ ค๊ณ ์š”.

๋ชจ๋ธ ๋ฆฌ์ŠคํŠธ๋Š”?

์ฐธ๊ณ ๋กœ Electra์™ธ์— BERT ๊ณ„์—ด ๋ชจ๋ธ ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
Korean-Hate-Speech (Dev)
(F1)
KoBERT 89.59 87.92 81.25 79.62 81.59 94.85 66.21
HanBERT 90.06 87.70 82.95 80.32 82.73 94.72 68.32
kcbert-base 89.87 85.00 67.40 75.57 75.94 93.93 68.78
KoELECTRA-Base-v3 90.63 88.11 84.45 82.24 85.53 95.25 67.61
albert-kor-base 89.45 82.66 81.20 79.42 81.76 94.59 65.44
bert-kor-base 90.87 87.27 82.80 82.32 84.31 95.25 68.45
electra-kor-base 91.29 87.20 85.50 83.11 85.46 95.78 66.03
funnel-kor-base 91.36 88.02 83.90 84.52 95.51 68.18
โš ๏ธ **GitHub.com Fallback** โš ๏ธ