Bonito - juunho/SKCC_LCL GitHub Wiki

1. Bonito

Bonito๋Š” unannotated text๋ฅผ task๋ณ„ training dataset์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ instruction์„ ํŠœ๋‹ํ•˜๋Š” open-source ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์„ ์ด์šฉํ•ด, raw corpus๋กœ ๋‹ค์–‘ํ•œ task์˜ instruction / input / output์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ๋‚ด์šฉ์˜ paper ๋ฐ code์˜ ์ถœ์ฒ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

image

๋ณธ LCL ํ™œ๋™์„ ํ†ตํ•ด ํ•ด๋‹น pipeline code๋ฅผ ํ•œ๊ตญ์–ด raw corpus์— ๋™์ž‘ํ•˜๋„๋ก ์ˆ˜์ •ํ•˜์˜€์œผ๋ฉฐ(output์€ ๋ฒˆ์—ญ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ), ํ•œ๊ตญ์–ด raw corpus๊ฐ€ input์œผ๋กœ ๋“ค์–ด๊ฐ”์„ ๋•Œ, ๋‹ค์Œ 16๊ฐ€์ง€ task์˜ dataset์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. exqa ("extractive question answering")
  2. mcqa ("multiple-choice question answering")
  3. qg ("question generation")
  4. qa ("question answering without choices")
  5. ynqa ("yes-no question-answering")
  6. coref ("coreference resolution")
  7. paraphrase ("paraphrase generation")
  8. paraphrase_id ("paraphrase identification")
  9. sent_comp ("sentence completion")
  10. sentiment ("sentiment")
  11. summarization ("summarization")
  12. text_gen ("text generation")
  13. topic_class ("topic classification")
  14. wsd ("word sense disambiguation")
  15. te ("textual entailment")
  16. nli ("natural language inference")

๋‹ค์Œ ๊ฒฝ๋กœ์˜ ์ฝ”๋“œ: bonito/testing.py๋Š” mistral-7B๋กœ ์ด๋ฏธ ํ•™์Šต๋˜์–ด ์žˆ๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ, ๋ช‡ ๊ฐ€์ง€ dataset์„ ๋งŒ๋“œ๋Š” ์˜ˆ์ œ๊ฐ€ ๋“ค์–ด์žˆ๋Š” ํŒŒ์ผ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์ฐธ๊ณ  ๋ฐ ์‘์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ๊ฐ€์ง€ dataset์„ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.