Bonito - juunho/SKCC_LCL GitHub Wiki

1. Bonito

Bonito는 unannotated text를 task별 training dataset으로 변환하여 instruction을 튜닝하는 open-source 모델입니다. 이 모델을 이용해, raw corpus로 다양한 task의 instruction / input / output의 데이터를 생성할 수 있습니다.

본 내용의 paper 및 code의 출처는 다음과 같습니다.

Paper: Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation
Model: bonito-v1
Dataset: ctga-v1
Code: To reproduce experiments in our paper, see nayak-arxiv24-code.

본 LCL 활동을 통해 해당 pipeline code를 한국어 raw corpus에 동작하도록 수정하였으며(output은 번역이 필요할 수 있음), 한국어 raw corpus가 input으로 들어갔을 때, 다음 16가지 task의 dataset을 구축할 수 있습니다.

exqa ("extractive question answering")
mcqa ("multiple-choice question answering")
qg ("question generation")
qa ("question answering without choices")
ynqa ("yes-no question-answering")
coref ("coreference resolution")
paraphrase ("paraphrase generation")
paraphrase_id ("paraphrase identification")
sent_comp ("sentence completion")
sentiment ("sentiment")
summarization ("summarization")
text_gen ("text generation")
topic_class ("topic classification")
wsd ("word sense disambiguation")
te ("textual entailment")
nli ("natural language inference")

다음 경로의 코드: bonito/testing.py는 mistral-7B로 이미 학습되어 있는 모델을 사용하여, 몇 가지 dataset을 만드는 예제가 들어있는 파일입니다. 이를 참고 및 응용하여 여러가지 dataset을 추가적으로 구축할 수 있습니다.