Week15 Day2 - ai-esg/our-history GitHub Wiki

ํŒ€ NLP 11์กฐ Week15 Day2

๋ชฉ์ฐจ

์ผ์ž

  • 2021.11.09 ํ™”

ํŒ€์›

  • ๋ฌธ์„์•”_T2075
  • ๋ฐ•๋งˆ๋ฃจ์ฐฌ_T2078
  • ๋ฐ•์•„๋ฉ˜_T2090
  • ์šฐ์›์ง„_T2137
  • ์œค์˜ํ›ˆ_T2142
  • ์žฅ๋™๊ฑด_T2185
  • ํ™ํ˜„์Šน_T2250

์ฃผ๊ฐ„ ์ผ์ •

ํ”ผ์–ด์„ธ์…˜

์ตœ์ข… ํ”„๋กœ์ ํŠธ

  • ์›น๊ธฐ๋ฐ˜ ์„œ๋น„์Šค (ํ™•์ •์€ ์•„๋‹˜)

  • ์•„์ด๋””์–ด ์ƒ๊ฐํ•ด์˜ค๊ธฐ (ํ™”)

    • ๊ฐ์ •๋ถ„๋ฅ˜?
    • ์š”์•ฝํ•˜๊ธฐ?
    • ์ž๋™ ์งˆ์˜์‘๋‹ต๋ด‡?
    • API ์„œ๋น™๊นŒ์ง€ ์ƒ๊ฐํ•ด๋ณด์ž? (BE)

ํ•œ ์ฃผ์˜ ๋ชฉํ‘œ

ํ™”

์ˆ˜

  • ํƒœ๊น…์„ ํ•˜์ž!
  • Relation 10๊ฐœ ๋ฝ‘๊ธฐ. - ๊ฐ Relation๋ณ„๋กœ obj, sbj์˜ entity ์ž‘์„ฑ.
๋ชจ์ด๊ธฐ ์ „์— ํ•  ์ผ
  • Relation ํ›„๋ณด ๋ฝ‘์•„์˜ค๊ธฐ. ๊ฐ์ž 4๊ฐœ? ๊ทผ๊ฑฐ ํฌํ•จ.

๋ชฉ? ๊ธˆ?

  • ๊ฐ€์ด๋“œ๋ผ์ธ ์ž‘์„ฑ. ๊ธˆ์š”์ผ์— ์ œ์ถœ.
  • Relation map ์ œ์ถœ.
๋…ผ์ 
  • ์–ด๋””์„œ ๋ถ€ํ„ฐ ํƒœ๊น…ํ•˜๋Š”๊ฐ€
  1. ๋ฌธ์žฅ ๋ถ„๋ฆฌ

0-1. entity ์ถ”์ถœ

์„์•”
  • sentence segmentation

    • ํฌ๋กค๋ง ์ง„ํ–‰ํ•ด์„œ data ๋ฌธ์žฅ ์ž์ฒด์—์„œ ๋…ธ์ด์Šค ๊ฐ์†Œ
    • KSS split
          use_heuristic= False,
          use_quotes_brackets_processing=True,
          ๊ทธ์™ธ ๋””ํดํŠธ
      
  • NER

    • pororo ner

      • ๋””ํดํŠธ ์‚ฌ์šฉ
      • quantity ์„ค์ •
    • ์ œ๊ฑฐ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค์ • ๊ฐ€๋Šฅ

      • ํƒœ๊น…์ด ์ „ํ˜€ ์—†๋Š” ๊ฒฝ์šฐ
      • ํƒœ๊น…์ด 2๊ฐœ ๋ฏธ๋งŒ์ธ ๊ฒฝ์šฐ
๋งˆ๋ฃจ์ฐฌ
  • sentence segmentation
    • lines = f.readlines(), for line in lines, split_sentences(line) (๊ธฐ๋ณธ)
    • if '.' in sentence: ๋กœ ์˜จ์  ์žˆ๋Š” ๋ฌธ์žฅ๋งŒ ์ทจ๊ธ‰
๋ฐ•์•„๋ฉ˜
  • sentence segmentation
    • ์˜จ์ ๊ณผ \n ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ์ˆ˜์ž‘์—…
    • ์ˆ˜์‹์€ ๊ทธ๋Œ€๋กœ ๋‚จ๊น€.
    • ์ดํ›„ ์ ๋‹นํ•œ ๋ฌธ์žฅ์€ ๋ถ™์—ฌ์คŒ. (์ˆ˜์ž‘์—…)
  • ner tagging ๊ด€๋ จ
    • pororo ์‚ฌ์šฉ
    • entity๊ฐ€ ์—†์œผ๋ฉด ์ œ๊ฑฐ. (์œ ์˜๋ฏธํ•œ ๋ฐ์ดํ„ฐ๋„ ๋งŽ์ด ์‚ฌ๋ผ์ง)
์›์ง„
  • sentence segmentation
    • ๋‹จ์–ด๋‚˜์—ด์€ ์ˆ˜์ž‘์—…์œผ๋กœ ์ง€์›Œ์คŒ
    • "\n"-> " " ์ดํ›„ kss
  • pororo ner tagging
์˜ํ›ˆ
  • sentence segmentation
    • "\n" split -> mecab ๊ตฌ๋‘์  split
  • sentence filtering
    • ์ข…๊ฒฐ ์–ด๋ฏธ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
    • pororo ner tagging์œผ๋กœ entity๊ฐ€ 2๊ฐœ ์ด์ƒ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
    • ๋™์‚ฌ ๋˜๋Š” ํ˜•์šฉ์‚ฌ๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ
๋™๊ฑด
  • sentence segmentation
    • konlpy.tag ๋‚ด์— ์žˆ๋Š” Kkma.sentence
ํ˜„์Šน
  • sentence segmentataion ->splitํ›„, kss์‚ฌ์šฉ
  • ์ž˜๋ฆฐ ๋ฌธ์žฅ๋“ค์ค‘, ๋ฌธ์žฅ์ด๋ผ๊ณ  ์—ฌ๊ฒจ์งˆ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์žฅ๋“ค ์ถ”์ถœ->morph_filter ๋ผ๋Š” ํ•จ์ˆ˜ ์‚ฌ์šฉํ•˜์˜€๊ณ , morph_filter์•ˆ์— mecab์ด ๋“ค์–ด๊ฐ€ ์žˆ์Œ(mecab์œผ๋กœ ์šฐ๋ฆฌ ๋‚˜๋ผ๋ง์€ ๋ฌธ์žฅ ์•ˆ์— ๋ช…์‚ฌ, ๋™์‚ฌ, ํ˜•์šฉ์‚ฌ๋ฅผ ๋‹ค ํฌํ•จํ•ด์•ผ ์˜จ์ „ํ•œ ๋ฌธ์žฅ์ด๋ฏ€๋กœ, ๋ฌธ์žฅ์ด 3 ์š”์†Œ๋ฅผ ๋‹ค ํฌํ•จํ•˜์ง€ ์•Š๊ณ  ์žˆ์œผ๋ฉด ์‚ญ์ œ)
  • pororo ner tagging

-> sentence_entity.csv

label 10๊ฐœ - sub <-> obj

  • ๋ฌธ์žฅ ํ•˜๋‚˜์— relation ์—ฌ๋Ÿฌ๊ฐœ ๋ฝ‘์„๊นŒ?
  1. ์–ด๋…ธํ…Œ์ด์…˜ ํ• ๋•Œ ๊ฐ™์ด entity ๊ฒฐํ•จ์žˆ๋Š” ๋ฌธ์žฅ ์™„์„ฑํ•˜๊ธฐ
  • en
  • re
    • label 10๊ฐœ
      • sub - obj

๋ฐ์ดํ„ฐ

์ŠคํŽ˜์…œ ๋ฏธ์…˜ ์ถ”์ฒœ TimeLine. (์ฃผ์ œ : ์ปดํ“จํ„ฐ๊ณผํ•™)

1. ~ 1์ฃผ์ฐจ ์ˆ˜์š”์ผ
  • ์บ ํผ๋ถ„๋“ค์€ ๊ฐ ์ฃผ์ œ์˜ ๋ฌธ์„œ๋“ค์—์„œ relation extraction์„ ํ•  ๋•Œ ๊ณ ๋ คํ•  ๋งŒ ํ•œ "Relation"๊ณผ ํ•ด๋‹นํ•˜๋Š” "Entity type"๋“ค์„ ์ง์ ‘ ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.

  • ex) ํŒŒ์ผ์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ ๊ด€๊ณ„๋ฅผ ๋งŒ๋“ค์–ด ์ฃผ๋Š” ์‹์œผ๋กœ?

    • ๊ทธ๋ž˜ํ”ฝ ์นด๋“œ.txt
      • <๊ทธ๋ž˜ํ”ฝ ์นด๋“œ>์˜ ์—ญ์‚ฌ๋Š” <1960๋…„>๋Œ€๋กœ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ„๋‹ค.
      • ๋น„๋””์˜ค ์–ด๋Œ‘ํ„ฐ(video adapter), ๋””์Šคํ”Œ๋ ˆ์ด ์นด๋“œ(display card), ๊ทธ๋ž˜ํ”ฝ ๋ณด๋“œ(graphics board), ๋””์Šคํ”Œ๋ ˆ์ด ์–ด๋Œ‘ํ„ฐ(display adapter), ๊ทธ๋ž˜ํ”ฝ ์–ด๋Œ‘ํ„ฐ(graphics adapter)๋ผ๊ณ ๋„ ๋ถ€๋ฅธ๋‹ค. ->
        • <๊ทธ๋ž˜ํ”ฝ ์นด๋“œ> - <๋น„๋””์˜ค ์–ด๋Œ‘ํ„ฐ> -> ์ด๋ฆ„ : ๋ณ„์นญ
        • <๊ทธ๋ž˜ํ”ฝ ์นด๋“œ> - <๋””์Šคํ”Œ๋ ˆ์ด ์นด๋“œ> -> ์ด๋ฆ„ : ๋ณ„์นญ
2. ~ 1์ฃผ์ฐจ ๊ธˆ์š”์ผ
  • ์„ ์ •๋œ Relation๋“ค์„ ์ •์˜ํ•˜๊ณ , ํ•ด๋‹น Relation set์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์žฅ์—์„œ Entity๋ฅผ ์žก์•„ ์–ด๋…ธํ…Œ์ด์…˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์ด๋“œ๋ผ์ธ์„ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ, ์•„๋ž˜ ์ฃผ์–ด์ง€๋Š” KLUE์˜ Relation set๊ณผ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ€์ด๋“œ๋ผ์ธ๊ณผ Relation map์€ ์ œ์ถœ. ๊ด€๋ จ ํ”ผ๋“œ๋ฐฑ์ด ๋‹ค์Œ์ฃผ์— ์žˆ์Œ.
3. ~ 2์ฃผ์ฐจ ์›”์š”์ผ
  • ์ตœ์ดˆ์˜ ๋ฐ์ดํ„ฐ ์–ด๋…ธํ…Œ์ด์…˜์€ ๊ฐ ๋ฉค๋ฒ„ ๋ณ„๋กœ ๋‚˜๋ˆ„์–ด ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค(์ด ๋‹จ๊ณ„์—์„œ๋Š” ์„œ๋กœ ๋‚ด์šฉ ๊ณต์œ  X). ์ฆ‰, ์ „์ฒด ๋ง๋ญ‰์น˜๋ฅผ ๋ฌธ์žฅ ๊ฐœ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ชผ๊ฐœ ๊ฐ ํŒ€์› ๋ณ„๋กœ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

  • ๋ชจ๋“  ์บ ํผ๋ถ„๋“ค์ด ๊ฐ์ž์˜ ์–ด๋…ธํ…Œ์ด์…˜์„ ์™„๋ฃŒํ•˜๋ฉด, ์™„์„ฑ๋œ ๋ฌธ์žฅ(with subj/obj entities)-๋ ˆ์ด๋ธ” pair์„ ๊ตฌ๊ธ€ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ๋กœ ์˜ฎ๊น๋‹ˆ๋‹ค

4. ~ 2์ฃผ์ฐจ ์ˆ˜์š”์ผ
  • ์•ž์„œ ์ œ์ž‘ํ•œ ๊ตฌ๊ธ€ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ์˜ Dropdown์„ ํ™œ์šฉํ•˜์—ฌ, ๋ชจ๋“  ์บ ํผ๋ถ„๋“ค์ด '์ž์‹ ์ด ์ง์ ‘ tagtog์—์„œ ์ œ์ž‘ํ•˜์ง€ ์•Š์€ Relation extraction ๋ฌธ์žฅ๋“ค์— ๋Œ€ํ•œ' ์–ด๋…ธํ…Œ์ด์…˜์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ณธ ์–ด๋…ธํ…Œ์ด์…˜ ๊ณผ์ •์—์„œ ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ํƒœ๊น… ๋ฐ ์ œ์ž‘์ž์˜ ํƒœ๊ทธ๋ฅผ ๋ณด์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€ (๋ณดํ†ต ์ „์ฒด์˜ 1/10 ์ •๋„) ๋งŒ ํƒœ๊น…ํ•˜๋Š” ํŒŒ์ผ๋Ÿฟ ํƒœ๊น… (pilot tagging) ์„ ํ†ตํ•ด ์„œ๋กœ์˜ ์ดํ•ด๊ฐ€ ๋‹ค๋ฅธ ๋ถ€๋ถ„์„ ํ™•์ธํ•˜๊ณ  discussion์„ ํ†ตํ•ด ๊ฐ€์ด๋“œ๋ผ์ธ ๋ฐ relation set์˜ ์ •์˜ ๋“ฑ์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. Pilot tagging ์‹œ์—๋Š” ํšŒ์˜๋ฅผ ํ†ตํ•ด ์„œ๋กœ์˜ ์–ด๋…ธํ…Œ์ด์…˜์„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ๋„ ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. ๋ฉ”์ธ ํƒœ๊น… (Main tagging) ์‹œ์—๋Š” ์›ฌ๋งŒํ•˜๋ฉด pilot tagging์—์„œ ํ™•์ •๋œ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ๋ฐ”ํƒ•์œผ๋กœ ์–ด๋…ธํ…Œ์ด์…˜์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

5. ~ 2์ฃผ์ฐจ ๋ชฉ์š”์ผ
  • KLUE ๊ฐ•์˜์—์„œ ์‚ฌ์šฉํ•œ RE ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ์ฝ”๋“œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ง์ ‘ ๋งŒ๋“  ๋ฐ์ดํ„ฐ์˜ ์„ฑ๋Šฅ์„ ์ฒดํฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ด๋ฒˆ ๊ณผ์ œ์˜ ์ •๋Ÿ‰์  ํ‰๊ฐ€์ง€ํ‘œ๋Š” 4๋ฒˆ ๊ณผ์ •์—์„œ ๋‚˜์˜ค๋Š” IAA์ด๋ฉฐ, Fleiss' Kappa๋ฅผ ๊ธฐ์ค€์œผ๋กœ 0.7์„ ๋„˜๊ธฐ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ•์˜

  • ์ž์œจ์ ์œผ๋กœ ๋“ฃ๊ธฐ
โš ๏ธ **GitHub.com Fallback** โš ๏ธ