Week14 Day1 - ai-esg/our-history GitHub Wiki

ํŒ€ NLP 11์กฐ Week14 Day1

๋ชฉ์ฐจ

์ผ์ž

  • 2021.11.1 ์›”

ํŒ€์›

  • ๋ฌธ์„์•”_T2075
  • ๋ฐ•๋งˆ๋ฃจ์ฐฌ_T2078
  • ๋ฐ•์•„๋ฉ˜_T2090
  • ์šฐ์›์ง„_T2137
  • ์œค์˜ํ›ˆ_T2142
  • ์žฅ๋™๊ฑด_T2185
  • ํ™ํ˜„์Šน_T2250

์ฃผ๊ฐ„ ์ผ์ •

Retrieval ์„ฑ๋Šฅ

  • Sparse : top5 ~ 58
  • Elastic : top5(84.8), top10(89)
  • Dense : top5 ~ 37

์ง„ํ–‰ ์ƒํ™ฉ

  • Elastic
    • ์™„๋ฃŒ. index_config ๋ณ€๊ฒฝ์„ ํ†ตํ•ด ๋˜ ๋‹ค๋ฅธ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•ด๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.
  • Dense
    • Epoch์ด 1 ์ด์ƒ์ผ ๋•Œ, sampling์„ ๋‹ค์‹œ ํ•˜๊ธฐ
    • Epoch์„ 1 ๋„˜์–ด๊ฐ€๋ฉด ์˜ค๋ฒ„ํ”ผํŒ… ์‹œ์ž‘๋จ (์˜ค๋ฒ„ํ”ผํŒ…์ด ๋œ๋‹ค๊ณ  training acc๊ฐ€ ~90 ์ด๋ ‡์ง€๋Š” ์•Š์Œ)
  • BM25
    • ์ž‘๋™์€ ํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ์œผ๋‚˜ ์„ฑ๋Šฅ์ด ๊ณผํ•˜๊ฒŒ ์ €์กฐํ•จ.

ํ•  ์ผ

  • preprocessing

    • preprocessing์„ wiki์™€ context์— ์ง„ํ–‰ ์‹œ, retrieval์˜ ์„ฑ๋Šฅ์ด ์•ฝ 0.8%์ •๋„ ํ–ฅ์ƒ๋จ. reader๋ฅผ ํ•™์Šต์‹œํ‚ฌ ๋•Œ, ์ ์šฉํ•˜์—ฌ ํ•™์Šต์‹œ์ผœ๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.
  • post processing ์กฐ์‚ฌ ์ œ๊ฑฐ

    • ๋๋ถ€๋ถ„๋งŒ. ์ค‘๊ฐ„์˜ ์กฐ์‚ฌ๋Š” ์œ ์ง€.
  • bm 25 (์žฅ๋™๊ฑด, ๋ฐ•์•„๋ฉ˜)

    • ์™„์„ฑ๋˜์–ด ์žˆ๋Š” ์ฝ”๋“œ๋ฅผ ๋ณด๊ณ  ์ž˜ ์ด์‹ํ•ด๋ณด๊ธฐ.
  • Dense

    • Sparse, elastic, dense topk๋กœ ํ›ˆ๋ จํ•˜๊ธฐ
    • 1์—ํญ ๋ฏธ๋งŒ ์„ฑ๋Šฅ ํ™•์ธ
  • https://www.sbert.net/docs/training/overview.html

    • top 30์˜ ์„ฑ๋Šฅ์€ ์•ฝ 48%์ •๋„. ๋งŽ์ด ์ €์กฐํ•จ.

Ensemble ๊ด€๋ จ

  • Reader์˜ ์ž…๋ ฅ n docs๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋„ ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ ์ƒ์ •ํ•  ์ˆ˜ ์žˆ๊ฒ ๋‹ค

  • Reader์˜ ๋ชจ๋ธ์„ klue/roberta-large, klue/roberta-base, KoELECTRA, xlm-roberta-large, XLNet ์ค‘ 4๊ฐœ ์ •๋„ ์‚ฐ์ •ํ•ด๋ณผ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

  • ์ค€๋น„ํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ:

    • Reader model

      • klue/roberta-large (์™„๋ฃŒ)
      • klue/roberta-base
      • KoELECTRA
      • xlm-roberta-large
      • XLNet
    • Retrieval model (top 5 60์€ ๋„˜์–ด์•ผ ํ•  ๊ฒƒ)

      • Elastic (์™„๋ฃŒ)
      • Dense + elastic(top 100)
      • Sparse + BM25

์ฝ”๋“œ ์ •๋ฆฌ ์Šฌ์Šฌ ํ•˜์ฃ 

ํ”ผ์–ด์„ธ์…˜