On the Generation of Medical Dialogs for COVID 19 - Songwooseok123/Study_Space GitHub Wiki

On the Generation of Medical Dialogs for COVID-19

[Motivation]

Covid 유행 → consult doctors/professional 부족함.

[Contribution]

medical dialog system that can provide COVID19-related consultations
To alleviate overfitting, we develop a multi-task learning approach, which regularizes the data-deficient dialog generation task with a masked token prediction task

[Datasets]

COVID-19(영,)
- 참고: Dataset 번역할 때 API 비교(https://jehyunlee.github.io/2023/02/20/Python-DS-128-transqual/)

[Method]

process a set of pairs ${(s_i,t_i)}$

target $t_i$ is a response from the doctor
source $s_i$ is the conversation history – the concatenation of all utterances (from both patient and doctor) before $t_i$

모델: BART Encoder( s 인코딩하고) +decoder( t를 뱉는 애)

Input : s
Output : t

Loss

the generation loss(g) + MTP loss(p)
- (오버피팅 막을라고 MTP, masked prediction task 추가함)

[Experiment]

Transformer

Encoder에 History 맥이고
decoder가 대답

GPT-2

training 데이터 : English Reddit dialogs - dialogpt 논문 테그 되어있음.

Untitled

Unregularized BART

initialized using pretrained BART
encoder and decoder are finetuned on CovidDialog
- During finetuning, no self-supervised regularization is used.

Unregularized BERT-GPT

Encoder : initialized using pretrained BERT
Decoder : initialized using pretrained GPT-2
Fintuned on CovidDialog
- During finetuning, no self-supervised regularization is used.

Task adaptive pretraining (TAPT)

encoder pretrained using BART/BERT on large-scale external corpora, it is further pretrained by predicting masked tokens on the input conversation histories in the CovidDialog datasets (without using output responses)
TAPT also performs maskedtoken prediction (MTP) on conversation histories. The difference is: TAPT performs the MTP task and the generation task sequentially while our method performs these two tasks jointly.

Setting

For pretrained models, we finetune them on the CovidDialog-English dataset for 5 epochs, while for the un-pretrained Transformer, we train it for 50 epochs. We set a checkpoint at the end of every epoch and finally take the one with the lowest perplexity on validation set as the final model. In response generation, for all models, we use beam search with beam width of 10 during decoding.

[Result]

평가지표

human evaluation
- 의학적 정확, 대화와 관련성, 의학적 정보량, 을마나 의사같은지
- Five medical students 오…학생 다섯명밖에 안 썼네..
Perplexity
NIST-n, BLEU-n, METEOR : 기계번역 평가 방법으로 좋지만 dialogue 시스템을 평가하기엔 reliable 하지 않다.
Entropy-n, Dist-n : 다양성을 평