A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics and Benchmark Datasets - Songwooseok123/Study_Space GitHub Wiki
๋งํฌ)
MRC survey ๋ ผ๋ฌธ(1. MRC task ๋ถ์ ๋ฐ ๋ถ๋ฅ Taxonomy ์ ์
2. Evaluation metric ์์ฝ
3. Discuss open issues in MRC research and future research directions
4. Benchmark Dataset
1. MRC Tasks
๊ธฐ๊ณ๋ ํด(Machine Reading Comprehension)?
์ฃผ์ด์ง ์ง๋ฌธ(Context)์ ์ดํดํ๊ณ , ์ฃผ์ด์ง ์ง์(Query/Question)์ ๋ต๋ณ(Answer)์ ์ถ๋ก ํ๋ ๋ฌธ์
- Search engine ๋ฐ Dialogue system(Chatbot)์์ ์ฌ์ฉ
STEP 1. Retrieval: Query์ ๋ํ ์ ๋ณด๊ฐ ๋ด๊ธด ์ง๋ฌธ์ ์ฐพ์ ํ STEP 2. Read: ๊ทธ ์ง๋ฌธ์ ์ฝ์ด์(Read) ๋ต๋ณ์ ์ฐพ๋๋ค.
1.1 Definition
๋ชฉํ: Learn predictor $f$ $$a = f(p,q)$$
- Training examples ${(p_i,q_i,a_i)}$
- $p$ : passage of text(text์ ์ด๋ค ๋ถ๋ถ)
- $q$ : text p์ ๊ดํ ์ง๋ฌธ
- $a$ : answer
1.2. MRC vs. QA(Question Answering)
- ๋๋ถ๋ถ MRC task๋ textual question task๋ก QA์ ๋น์ทํ ํ์.
- ํ์ง๋ง ํฌํจ๊ด๊ณ๋ ์๋
- MRC๋ ์ฃผ์ด์ง Context๊ฐ ๊ผญ ์๊ณ ๊ด๋ จ๋ ์ง๋ฌธ์ ๋ต์ ํ๋ ๋ฌธ์ .
- QA๋ Answer๋ฅผ ์ป๊ธฐ ์ํด,Context๋ฅผ ๊ผญ ์ฝ์ง ์์๋ ๋๋ task๊ฐ ์์ (์ฌ๋๋ค์ด ํํ ์๊ณ ์๋ ์ ๋ณด(์์, ์ํ์ ๋ณด ๋ฑ)์ ๋ฏธ๋ฆฌ ์๊ณ ์๋ค๊ณ ๊ฐ์ ํ๊ณ , ๋๋ฌธ์ ๊ท์น์ด๋ ์์๊ณผ ๊ฐ์ ๋ฌธ์ ๋ฅผ ํธ๋ ๊ฒ๋ค๋ ์์)
1.3. Classification of MRC Tasks
- 1.3.1. Type of corpus
- 1.3.2. Type of questions
- 1.3.3. Source of answers
- 1.3.4. Type of answers
Definition of each category
[Notations]
- $V$ : Pure textual vocabulary
- $M$ : Multi-modal dataset(consists of images or other non-text information)
- $P$ = { $C_i,Q_i,A_i$ } $_{i=1}^{n}$ : corpus
- $C_i$ = { $c_0, c_1,..., c_{l_{ci}}$ } : i-th context
- $Q_i$ = { $q_0, q_1,..., q_{l_{qi}}$ }: i-th question
- $A_i$ = { $a_0, a_1,..., a_{l_{ai}}$ }: answer to question $Q_i$ according to context $C_i$
- $l_{ci}, l_{qi},l_{ai}$ : the length of the i-th context $C_i$, question $Q_i$ ,answer $A_i$
1.3.1. Type of Corpus
(1) Multi-modal corpus : entities in the corpus consists of text and images at the same time
- $P โฉ V โ โ $ and $P โฉ M โ โ $
(2)Textual
- $P โฉ V โ โ $ and $P โฉ M = โ $
1.3.2. Type of Questions
(1) Cloze form
- ๋น์นธ(placeholder)์ ๋ซ์ด ๋ ธ๊ณ ์ ์ ํ ๋ต(image, word, phrase)์ ์ฐพ๋๋ค.
- ํ์๋ฌธ or ๋ช
๋ น๋ฌธ
- multi-modal cloze style question(์ผ์ชฝ) / textual cloze question(์ค๋ฅธ์ชฝ)
- Given the context $C$ = { $c_0, c_1,...,c_j...c_{j+n},...,c_{lc}$ } ( $0 โค j โค lc, 0 โค n โคlc โ 1, c_j โ V โช M$ )
- $A$ = { $c_j...c_{j+n}$ } : short span in Context $C$
- Context $C$ ์ค์์ span $A$ ๋ถ๋ถ์ placeholder $X$ ๋ก ๋์ฒดํ๋ฉด , 'cloze question $Q$ ' for context $C$ is formed.
- $Q$ = { $c_0, c_1,...,X,...,c_{lc}$ }
- $A$ = { $c_j...c_{j+n}$ } : answer to question $Q$
(2)Natural form
- Placeholder ์์ด, ๋ฌธ๋ฒ์ ์ ๋ฐ๋ฅด๋ ์๋ฒฝํ ๋ฌธ์ฅ
- ๋๋ถ๋ถ ์๋ฌธ๋ฌธ
- exception ex) "please find the correct statement from the following options."
(3)Synthesis form
combination of words
not a complete sentence that fully conforms to the natural language grammar
1.3.3. Type of Answers
(1)Multi-choices form(๊ฐ๊ด์)
- Given the candidate answers $A$ = { $A_1,...A_j,..., A_n$ }
- n denotes the number of candidate answers for each question
- The goal of the task is to find the right answer Aj (0 โค j โค n) from A
(2)Natural form (์ฃผ๊ด์, ์์ ํ)
- The answer is a natural word, phrase, sentence or image
1.3.4. Source of Answers
(1)Spans
- ๋ต์ Context ๋ด์์ ์ถ์ถํ๋ฉด span
(2)Free-form
- A free-form answer may be any phrase, word, or even image (not necessarily from the context).
1.4. Statistics of MRC Tasks
"A fundamental characteristic of human language understanding is multimodality. At present, the proportion of multi-modal reading comprehension tasks is still small, about 10.53% ใ "
2. Evaluation Metrics
2.1. Accuracy
$$Accuracy = {M \over N}$$
- N :MRC task contains N questions (each question corresponds to one correct answer)
- M : the number of questions that the system answers correctly is M
2.2. Exact Match
- ์ ๋ต์ด ๋ฌธ์ฅ์ด๋ ๊ตฌ๋ฌธ์ผ ๋ ์ฐ์ -> Span prediction task์์ ์ธ ์ ์์
- ๋ชจ๋ word๊ฐ ๊ฐ์์ผ ์ ๋ต์.
- multi-choice task ์์๋ ์ฐ์ด์ง ์์, because there is no situation where the answer includes a portion of the correct answer.
- ๊ฐ QUESTION๋ง๋ค ์ ๋ต์ ๋ง์ด ๋ง๋ค์ด ๋์ผ๋ฉด ์ ์๊ฐ ์ฌ๋ผ๊ฐ๊ฒ ๋ค.
- bdml ์ฐ๊ตฌ์ค, bdml ๋ฉ์ค, bdml ๋ฉ $$EM = {M \over N}$$
2.3. Precision ,Recall & F1-score
-
Precision: True๋ผ๊ณ ์์ธกํ ๊ฒ ์ค์ ์ง์ง True์ธ ๊ฐฏ์ $$Precision = {TP \over TP + FP}$$
-
Recall : ์ค์ True ์ธ ๊ฒ ์ค True๋ผ๊ณ ์์ธกํ ๊ฐฏ์ $$Recall = {TP \over TP + FN}$$
-
F1 score : Precision๊ณผ Recall์ ์กฐํ ํ๊ท $$F1 score = {2 \over {1 \over Precision} + {1 \over Recall}}$$
2.5.1. Token-level
- TP :denotes the same tokens between the predicted answer and the correct answer
- FP : denotes the tokens which are not in the correct answer but the predicted answer
$$Precision_{TS} = {TP_T \over TP_T + FP_T}$$
- ex) True label : ์ฐ์์ด์ ๋ค๋ฆฌ , Predicted label : ์ฐ์์ด์ ์ด๊นจ Precision = 1/2
2.5.2. Question-level
- The question-level precision represents the average percentage of answer overlaps (not token overlap) between all the correct answers and all the predicted answers in a task
- TP : denotes the shared answers between all predicted answers and all correct answers $$Precision_{Q} = {TP_Q \over TP_Q + FP_Q}$$
- ex) True label : bdml ์ฐ๊ตฌ์ค, bdml, ๋น ๋ฐ์ดํฐ๋ง์ด๋ ์ฐ๊ตฌ์ค, ๋น ๋ฐ์ดํฐ๋ง์ด๋ ๋ฉ , Predicted label: bdml ์ฐ๊ตฌ์ค, bdml, mcc Precision = 2/3
2.6 ROUGE & BLEU & Meteor
- summary task, ๊ธฐ๊ณ๋ฒ์ญ ๊ฐ์ ์์ฑ ๋ชจ๋ธ ํ๊ฐํ ๋ ์ฐ์ด๋ ํ๊ฐ๋ฐฉ๋ฒ์ผ๋ก mrc ํ๊ฐ์๋ ์ฐ์.
2.6.1. Rouge
2.6.2. BLEU
- $P_n$ : modified n-gram precision
- $w_n$ : weight of n-gram -> summing to 1
- BP : ๋ฌธ์ฅ ๊ธธ์ด์ ๋ํ ๊ณผ์ ํฉ ๋ณด์ reference๊ฐ 10๋จ์ด๊ฐ ๋๋ ๋ฌธ์ฅ์ธ๋ฐ candidate๊ฐ ๋๋ฌด ์งง์ผ๋ฉด precision์ด ๋์ ์ ์๋ฅผ ๋ฐ์ผ๋๊น ํจ๋ํฐ๋ฅผ ์ฃผ๊ธฐ์ํจ . candidate๊ฐ ๋๋ฌด ์งง์ผ๋ฉด exp(์์) -> 1 ๋ณด๋ค ์์ .
2.6.3. Meteor
Unlike the BLEU using only Precision bleu์ ๋จ์ ๋ณด์.
ฮฑ ์์ ์ ์ค์ , recall์ weight ๋ง์ด ์ค.
- ch : 3
- m : 6
- parameters ฮฑ, ฮฒ, and ฮณ are tuned to maximize correlation with human judgment
2.9 HEQ Human Equivalence Score
- in which questions with multiple valid answers, the F1 may be misleading $$HEQ = {M \over N}$$
2.10 Statistics of Evaluation Metrics
3. Discuss open issues in MRC research and future research directions
4. Benchmark Dataset
์ด ์น์ ์์๋ ์์ ์ ์ํ taxonomy ์ธ์ ๋ฐ์ดํฐ์ ์ ๋ํ ์์ฑ๋ค์ ์์ ํ๋ค.
- 4.1. ๋ฐ์ดํฐ์ ์ ์ฌ์ด์ฆ
- 4.4. ์ปจํ ์คํธ ํ์ (paragraph, document, multi-paragraph ๋ฑ)
4.9. Characteristics of Datasets
4.9.2. MRC with Unanswerable Questions
๋๋ตํ ์ ์๋ ์ง๋ฌธ์ ๋ํด ๋๋ตํ ์ ์์์ ๋งํ๋ ๊ฒ๋ MRC ๋ชจ๋ธ์ ์๊ตฌํ๋ ๋ฅ๋ ฅ์ด๋ค.
4.9.3. Multi-hop Reading Comprehension
๋จ์ผ ์ง๋ฌธ๊ณผ ๋น๊ตํ์ ๋, ๋ณด๋ค ์ฌ๋ฌ ์ง๋ฌธ์ ํตํ ์ ๋ต ์ถ๋ก ์ ์๊ตฌํ๋ค. (๋จ์๋ค์ ํตํด ๋จ๊ณ์ ์ผ๋ก ์ถ๋ก ํด์ผ ์ ์ ์๋ ์ ๋ต)
- A deep cascade model for multi-document reading comprehension (AAAI 2019)
- Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification (ACL 2018)
- Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences (NAACL 2018)
4.9.4. Multi-modal Reading Comprehension
์ค์ง ํ ์คํธ๋ง ์ฃผ์ด์ก์ ๋, ์ ๋ณด๊ฐ ๋ถ์กฑํ ๋ฌธ์ ๋ค์ด ์๋ค. ์๋ฅผ ๋ค์ด, ์ด๋ฏธ์ง์ ํ ์คํธ ์์ชฝ์ ๋ํ ์ดํด๊ฐ ํ์ํ ๊ฒฝ์ฐ๊ฐ ์๋ค.
Multi-modal machine reading comprehension is a dynamic interdisciplinary field that has great application potential.
4.9.5. Reading Comprehension Require Commonsense or World knowledge
์ฃผ์ด์ง ์ง๋ฌธ ์์์๋ง ๋ต์ ์ฐพ๋ ์ ํต์ ์ธ MRC์ ๋ฌ๋ฆฌ, commonsense๋ฅผ ์๊ตฌํ๋ ๋ฐ์ดํฐ์ ๋ ๋ฑ์ฅํ๊ธฐ ์์ํ๋ค. ๋ ผ๋ฌธ์์๋ ์ด ๋ถ์ผ์ ๋ํ ์ ๊ทผ ๋ฐฉ๋ฒ์ ์์ ํ์ง ์์.
4.9.6. Complex Reasoning MRC
์ค์ ๋ก ์ปจํ ์คํธ๋ฅผ ์ดํดํ๊ณ ์ถ๋ก ์ ์ํํ๋ ์ง ํ์ธํ๋ ๋ฐ์ ๋ชฉ์ ์ ๋ ๋ฐ์ดํฐ์ ์ด๋ค.
4.9.7. Conversational Reading Comprehension โญ๏ธ
Conversational machine reading comprehension (CMRC). ์ ์๊ฐ ๋งํ๊ธธ, ์ต๊ทผ NLP ๋ถ์ผ์์ ์๋ก์ด ํ ํฝ์ผ๋ก ๋๋๋๋ค๊ณ ํ๋ค.
์ผ๋ จ์ ๋ํ๋ก๋ถํฐ ์ ๋ณด๋ฅผ ์ป์ด๋ด๋ ํ์คํฌ์ด๋ค. (์ต์ข ์ ๋ต์ ์์ธกํ๊ธฐ ์ํ ์ค๋ฌด๊ณ ๊ฐ?)
์ฌ๊ธฐ์ ๋ํ(conversation)๊ฐ QA๋ก ์ด๋ฃจ์ด์ ธ ์๋ ์ง๋ ํ์ธํด๋ณผ ํ์๊ฐ ์๋ค.
์ฌ๋์ด ํน์ ๋ชฉ์ ์ ๋ํ ์ ๋ณด๋ฅผ ์ป๊ธฐ ์ํด ๊ณ์ํด์ ์ง๋ฌธ์ ๋์ง๋ ๊ฒ์ฒ๋ผ, ๋ชจ๋ธ๋ ์ฐ์์ ์ธ ๋ํ์์ ํ์ํ ์ ๋ณด๋ฅผ ์ถ์ถํ ์ ์์ด์ผ ํ๋ค.
์ด ๋, ๋ชจ๋ธ์ด ํ ์ ์๋ ๊ฒ์
- ์๋ก์ด ๋ํ์ ๋ํ ์ ์ ํ ๋๋ต (ChatGPT?)
- ์๋ฏธ์๋ ์๋ก์ด ์ง๋ฌธ์ ๋ฌป๊ธฐ
์๋ฌธ :
It is a natural way for human beings to exchange information through a series of conversations. In the typical MRC tasks, different question and answer pairs are usually independent of each other. However, in real human language communication, we often achieve an efficient understanding of complex information through a series of interrelated conversations. Similarly, in human communication scenarios, we often ask questions on our own initiative, to obtain key information that helps us understand the situation. In the process of conversation, we need to have a deep understanding of the previous conversations in order to answer each otherโs questions correctly or ask meaningful new questions. Therefore, in this process, historical conversation information also becomes a part of the context.
In recent years, conversational machine reading comprehension (CMRC) has become a new research hotspot in the NLP community, and there emerged many related datasets, such as CoQA [49], QuAC [68], DREAM [83] and ShARC [39].
4.9.8. Domain-specific Datasets
์ปจํ ์คํธ๊ฐ ํน์ ๋๋ฉ์ธ์ธ ๋ฐ์ดํฐ์ ์ด๋ค. ์๋ฅผ ๋ค์ด, ๊ณผํ์ด๋ ๋ฉ๋์ปฌ ๋ฆฌํฌํธ๊ฐ ์๋ค.
- CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension (NAACL 2018)
4.9.9. MRC with Paraphrased Paragraph
๊ฐ์ ๋ด์ฉ์ ๋ค๋ฅธ ๋ง๋ก ๋ฐ๊ฟ์ ํํํ ์์ ํฌํจํ๋ ๋ฐ์ดํฐ์ ์ด๋ค.
4.9.10. Large-scale MRC Dataset
Large-scale dataset์ ๋ฅ๋ฌ๋ ํ์ต์ ์ฉ์ดํ๋ค.
4.9.11. MRC dataset for Open-Domain QA
The open-domain question answering was originally defined as finding answers in collections of unstructured documents.