A Survey on Machine Reading Comprehension: Tasks, Evaluation Metrics and Benchmark Datasets - Songwooseok123/Study_Space GitHub Wiki

MRC survey ๋…ผ๋ฌธ(๋งํฌ)

1. MRC task ๋ถ„์„ ๋ฐ ๋ถ„๋ฅ˜ Taxonomy ์ œ์‹œ 2. Evaluation metric ์š”์•ฝ 3. Discuss open issues in MRC research and future research directions 4. Benchmark Dataset


1. MRC Tasks

๊ธฐ๊ณ„๋…ํ•ด(Machine Reading Comprehension)?

์ฃผ์–ด์ง„ ์ง€๋ฌธ(Context)์„ ์ดํ•ดํ•˜๊ณ , ์ฃผ์–ด์ง„ ์งˆ์˜(Query/Question)์˜ ๋‹ต๋ณ€(Answer)์„ ์ถ”๋ก ํ•˜๋Š” ๋ฌธ์ œ

  • Search engine ๋ฐ Dialogue system(Chatbot)์—์„œ ์‚ฌ์šฉ

STEP 1. Retrieval: Query์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ์ง€๋ฌธ์„ ์ฐพ์€ ํ›„ STEP 2. Read: ๊ทธ ์ง€๋ฌธ์„ ์ฝ์–ด์„œ(Read) ๋‹ต๋ณ€์„ ์ฐพ๋Š”๋‹ค.

1.1 Definition

๋ชฉํ‘œ: Learn predictor $f$ $$a = f(p,q)$$

  • Training examples ${(p_i,q_i,a_i)}$
  • $p$ : passage of text(text์˜ ์–ด๋–ค ๋ถ€๋ถ„)
  • $q$ : text p์— ๊ด€ํ•œ ์งˆ๋ฌธ
  • $a$ : answer

1.2. MRC vs. QA(Question Answering)

  • ๋Œ€๋ถ€๋ถ„ MRC task๋Š” textual question task๋กœ QA์™€ ๋น„์Šทํ•œ ํ˜•์‹.
  • ํ•˜์ง€๋งŒ ํฌํ•จ๊ด€๊ณ„๋Š” ์•„๋‹˜
  • MRC๋Š” ์ฃผ์–ด์ง„ Context๊ฐ€ ๊ผญ ์žˆ๊ณ  ๊ด€๋ จ๋œ ์งˆ๋ฌธ์— ๋‹ต์„ ํ•˜๋Š” ๋ฌธ์ œ.
  • QA๋Š” Answer๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด,Context๋ฅผ ๊ผญ ์ฝ์ง€ ์•Š์•„๋„ ๋˜๋Š” task๊ฐ€ ์žˆ์Œ (์‚ฌ๋žŒ๋“ค์ด ํ”ํžˆ ์•Œ๊ณ ์žˆ๋Š” ์ •๋ณด(์ƒ์‹, ์ƒํ™œ์ •๋ณด ๋“ฑ)์„ ๋ฏธ๋ฆฌ ์•Œ๊ณ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , ๋•Œ๋ฌธ์— ๊ทœ์น™์ด๋‚˜ ์ƒ์‹๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ๋“ค๋„ ์žˆ์Œ)

1.3. Classification of MRC Tasks

  • 1.3.1. Type of corpus
  • 1.3.2. Type of questions
  • 1.3.3. Source of answers
  • 1.3.4. Type of answers

Definition of each category

[Notations]

  • $V$ : Pure textual vocabulary
  • $M$ : Multi-modal dataset(consists of images or other non-text information)
  • $P$ = { $C_i,Q_i,A_i$ } $_{i=1}^{n}$ : corpus
  • $C_i$ = { $c_0, c_1,..., c_{l_{ci}}$ } : i-th context
  • $Q_i$ = { $q_0, q_1,..., q_{l_{qi}}$ }: i-th question
  • $A_i$ = { $a_0, a_1,..., a_{l_{ai}}$ }: answer to question $Q_i$ according to context $C_i$
  • $l_{ci}, l_{qi},l_{ai}$ : the length of the i-th context $C_i$, question $Q_i$ ,answer $A_i$

1.3.1. Type of Corpus

(1) Multi-modal corpus : entities in the corpus consists of text and images at the same time

  • $P โˆฉ V โ‰  โˆ…$ and $P โˆฉ M โ‰  โˆ…$

(2)Textual

  • $P โˆฉ V โ‰  โˆ…$ and $P โˆฉ M = โˆ…$

image

1.3.2. Type of Questions

(1) Cloze form

  • ๋นˆ์นธ(placeholder)์„ ๋šซ์–ด ๋…ธ๊ณ  ์ ์ ˆํ•œ ๋‹ต(image, word, phrase)์„ ์ฐพ๋Š”๋‹ค.
  • ํ‰์„œ๋ฌธ or ๋ช…๋ น๋ฌธ
    • multi-modal cloze style question(์™ผ์ชฝ) / textual cloze question(์˜ค๋ฅธ์ชฝ)
  • Given the context $C$ = { $c_0, c_1,...,c_j...c_{j+n},...,c_{lc}$ } ( $0 โ‰ค j โ‰ค lc, 0 โ‰ค n โ‰คlc โˆ’ 1, c_j โˆˆ V โˆช M$ )
  • $A$ = { $c_j...c_{j+n}$ } : short span in Context $C$
    • Context $C$ ์ค‘์—์„œ span $A$ ๋ถ€๋ถ„์„ placeholder $X$ ๋กœ ๋Œ€์ฒดํ•˜๋ฉด , 'cloze question $Q$ ' for context $C$ is formed.
  • $Q$ = { $c_0, c_1,...,X,...,c_{lc}$ }
  • $A$ = { $c_j...c_{j+n}$ } : answer to question $Q$

(2)Natural form

  • Placeholder ์—†์ด, ๋ฌธ๋ฒ•์„ ์ž˜ ๋”ฐ๋ฅด๋Š” ์™„๋ฒฝํ•œ ๋ฌธ์žฅ
  • ๋Œ€๋ถ€๋ถ„ ์˜๋ฌธ๋ฌธ
    • exception ex) "please find the correct statement from the following options."

(3)Synthesis form

combination of words not a complete sentence that fully conforms to the natural language grammar image

1.3.3. Type of Answers

(1)Multi-choices form(๊ฐ๊ด€์‹)

  • Given the candidate answers $A$ = { $A_1,...A_j,..., A_n$ }
    • n denotes the number of candidate answers for each question
  • The goal of the task is to find the right answer Aj (0 โ‰ค j โ‰ค n) from A

(2)Natural form (์ฃผ๊ด€์‹, ์„œ์ˆ ํ˜•)

  • The answer is a natural word, phrase, sentence or image

1.3.4. Source of Answers

(1)Spans

  • ๋‹ต์„ Context ๋‚ด์—์„œ ์ถ”์ถœํ•˜๋ฉด span

(2)Free-form

  • A free-form answer may be any phrase, word, or even image (not necessarily from the context). image

1.4. Statistics of MRC Tasks

image

"A fundamental characteristic of human language understanding is multimodality. At present, the proportion of multi-modal reading comprehension tasks is still small, about 10.53% ใ…  "

2. Evaluation Metrics

2.1. Accuracy

$$Accuracy = {M \over N}$$

  • N :MRC task contains N questions (each question corresponds to one correct answer)
  • M : the number of questions that the system answers correctly is M

2.2. Exact Match

  • ์ •๋‹ต์ด ๋ฌธ์žฅ์ด๋‚˜ ๊ตฌ๋ฌธ์ผ ๋•Œ ์“ฐ์ž„ -> Span prediction task์—์„œ ์“ธ ์ˆ˜ ์žˆ์Œ
  • ๋ชจ๋“  word๊ฐ€ ๊ฐ™์•„์•ผ ์ •๋‹ต์ž„.
  • multi-choice task ์—์„œ๋Š” ์“ฐ์ด์ง€ ์•Š์Œ, because there is no situation where the answer includes a portion of the correct answer.
  • ๊ฐ QUESTION๋งˆ๋‹ค ์ •๋‹ต์„ ๋งŽ์ด ๋งŒ๋“ค์–ด ๋†“์œผ๋ฉด ์ ์ˆ˜๊ฐ€ ์˜ฌ๋ผ๊ฐ€๊ฒ ๋‹ค.
    • bdml ์—ฐ๊ตฌ์‹ค, bdml ๋žฉ์‹ค, bdml ๋žฉ $$EM = {M \over N}$$

2.3. Precision ,Recall & F1-score

  • Precision: True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒƒ ์ค‘์— ์ง„์งœ True์ธ ๊ฐฏ์ˆ˜ $$Precision = {TP \over TP + FP}$$

  • Recall : ์‹ค์ œ True ์ธ ๊ฒƒ ์ค‘ True๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฐฏ์ˆ˜ $$Recall = {TP \over TP + FN}$$

  • F1 score : Precision๊ณผ Recall์˜ ์กฐํ™” ํ‰๊ท  $$F1 score = {2 \over {1 \over Precision} + {1 \over Recall}}$$

2.5.1. Token-level

  • TP :denotes the same tokens between the predicted answer and the correct answer
  • FP : denotes the tokens which are not in the correct answer but the predicted answer $$Precision_{TS} = {TP_T \over TP_T + FP_T}$$
    • ex) True label : ์šฐ์„์ด์˜ ๋‹ค๋ฆฌ , Predicted label : ์šฐ์„์ด์˜ ์–ด๊นจ Precision = 1/2

2.5.2. Question-level

  • The question-level precision represents the average percentage of answer overlaps (not token overlap) between all the correct answers and all the predicted answers in a task
  • TP : denotes the shared answers between all predicted answers and all correct answers $$Precision_{Q} = {TP_Q \over TP_Q + FP_Q}$$
  • ex) True label : bdml ์—ฐ๊ตฌ์‹ค, bdml, ๋น…๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ์—ฐ๊ตฌ์‹ค, ๋น…๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ๋žฉ , Predicted label: bdml ์—ฐ๊ตฌ์‹ค, bdml, mcc Precision = 2/3

2.6 ROUGE & BLEU & Meteor

  • summary task, ๊ธฐ๊ณ„๋ฒˆ์—ญ ๊ฐ™์€ ์ƒ์„ฑ ๋ชจ๋ธ ํ‰๊ฐ€ํ•  ๋•Œ ์“ฐ์ด๋Š” ํ‰๊ฐ€๋ฐฉ๋ฒ•์œผ๋กœ mrc ํ‰๊ฐ€์—๋„ ์“ฐ์ž„.

2.6.1. Rouge

image image

2.6.2. BLEU

image

  • $P_n$ : modified n-gram precision

image image

  • $w_n$ : weight of n-gram -> summing to 1
  • BP : ๋ฌธ์žฅ ๊ธธ์ด์— ๋Œ€ํ•œ ๊ณผ์ ํ•ฉ ๋ณด์ • reference๊ฐ€ 10๋‹จ์–ด๊ฐ€ ๋„˜๋Š” ๋ฌธ์žฅ์ธ๋ฐ candidate๊ฐ€ ๋„ˆ๋ฌด ์งง์œผ๋ฉด precision์ด ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›์œผ๋‹ˆ๊นŒ ํŒจ๋„ํ‹ฐ๋ฅผ ์ฃผ๊ธฐ์œ„ํ•จ . candidate๊ฐ€ ๋„ˆ๋ฌด ์งง์œผ๋ฉด exp(์Œ์ˆ˜) -> 1 ๋ณด๋‹ค ์ž‘์Œ . image

2.6.3. Meteor

Unlike the BLEU using only Precision bleu์˜ ๋‹จ์  ๋ณด์™„. image

image

ฮฑ ์ž‘์€ ์ˆ˜ ์ค˜์„œ , recall์— weight ๋งŽ์ด ์คŒ.

image

image

  • ch : 3
  • m : 6
  • parameters ฮฑ, ฮฒ, and ฮณ are tuned to maximize correlation with human judgment

2.9 HEQ Human Equivalence Score

  • in which questions with multiple valid answers, the F1 may be misleading $$HEQ = {M \over N}$$

2.10 Statistics of Evaluation Metrics

image

3. Discuss open issues in MRC research and future research directions

4. Benchmark Dataset

์ด ์„น์…˜์—์„œ๋Š” ์•ž์„œ ์ œ์•ˆํ•œ taxonomy ์™ธ์— ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์†์„ฑ๋“ค์„ ์„œ์ˆ ํ•œ๋‹ค.

  • 4.1. ๋ฐ์ดํ„ฐ์…‹์˜ ์‚ฌ์ด์ฆˆ
  • 4.4. ์ปจํ…์ŠคํŠธ ํƒ€์ž… (paragraph, document, multi-paragraph ๋“ฑ)

4.9. Characteristics of Datasets

4.9.2. MRC with Unanswerable Questions

๋Œ€๋‹ตํ•  ์ˆ˜ ์—†๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด ๋Œ€๋‹ตํ•  ์ˆ˜ ์—†์Œ์„ ๋งํ•˜๋Š” ๊ฒƒ๋„ MRC ๋ชจ๋ธ์— ์š”๊ตฌํ•˜๋Š” ๋Šฅ๋ ฅ์ด๋‹ค.

4.9.3. Multi-hop Reading Comprehension

๋‹จ์ผ ์ง€๋ฌธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ, ๋ณด๋‹ค ์—ฌ๋Ÿฌ ์ง€๋ฌธ์„ ํ†ตํ•œ ์ •๋‹ต ์ถ”๋ก ์„ ์š”๊ตฌํ•œ๋‹ค. (๋‹จ์„œ๋“ค์„ ํ†ตํ•ด ๋‹จ๊ณ„์ ์œผ๋กœ ์ถ”๋ก ํ•ด์•ผ ์•Œ ์ˆ˜ ์žˆ๋Š” ์ •๋‹ต)

  • A deep cascade model for multi-document reading comprehension (AAAI 2019)
  • Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification (ACL 2018)
  • Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences (NAACL 2018)

4.9.4. Multi-modal Reading Comprehension

์˜ค์ง ํ…์ŠคํŠธ๋งŒ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•œ ๋ฌธ์ œ๋“ค์ด ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์–‘์ชฝ์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค.

Multi-modal machine reading comprehension is a dynamic interdisciplinary field that has great application potential.

4.9.5. Reading Comprehension Require Commonsense or World knowledge

์ฃผ์–ด์ง„ ์ง€๋ฌธ ์†์—์„œ๋งŒ ๋‹ต์„ ์ฐพ๋Š” ์ „ํ†ต์ ์ธ MRC์™€ ๋‹ฌ๋ฆฌ, commonsense๋ฅผ ์š”๊ตฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹๋„ ๋“ฑ์žฅํ•˜๊ธฐ ์‹œ์ž‘ํ•œ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ๋ถ„์•ผ์— ๋Œ€ํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์€ ์„œ์ˆ ํ•˜์ง€ ์•Š์Œ.

4.9.6. Complex Reasoning MRC

์‹ค์ œ๋กœ ์ปจํ…์ŠคํŠธ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ง€ ํ™•์ธํ•˜๋Š” ๋ฐ์— ๋ชฉ์ ์„ ๋‘” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.

4.9.7. Conversational Reading Comprehension โญ๏ธ

Conversational machine reading comprehension (CMRC). ์ €์ž๊ฐ€ ๋งํ•˜๊ธธ, ์ตœ๊ทผ NLP ๋ถ„์•ผ์—์„œ ์ƒˆ๋กœ์šด ํ† ํ”ฝ์œผ๋กœ ๋Œ€๋‘๋œ๋‹ค๊ณ  ํ•œ๋‹ค.

์ผ๋ จ์˜ ๋Œ€ํ™”๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์–ป์–ด๋‚ด๋Š” ํƒœ์Šคํฌ์ด๋‹ค. (์ตœ์ข… ์ •๋‹ต์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์Šค๋ฌด๊ณ ๊ฐœ?)

์—ฌ๊ธฐ์„œ ๋Œ€ํ™”(conversation)๊ฐ€ QA๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋Š” ์ง€๋Š” ํ™•์ธํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

์‚ฌ๋žŒ์ด ํŠน์ • ๋ชฉ์ ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ๊ณ„์†ํ•ด์„œ ์งˆ๋ฌธ์„ ๋˜์ง€๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๋ชจ๋ธ๋„ ์—ฐ์†์ ์ธ ๋Œ€ํ™”์—์„œ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

์ด ๋•Œ, ๋ชจ๋ธ์ด ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€

  • ์ƒˆ๋กœ์šด ๋Œ€ํ™”์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ๋Œ€๋‹ต (ChatGPT?)
  • ์˜๋ฏธ์žˆ๋Š” ์ƒˆ๋กœ์šด ์งˆ๋ฌธ์„ ๋ฌป๊ธฐ

์›๋ฌธ :

It is a natural way for human beings to exchange information through a series of conversations. In the typical MRC tasks, different question and answer pairs are usually independent of each other. However, in real human language communication, we often achieve an efficient understanding of complex information through a series of interrelated conversations. Similarly, in human communication scenarios, we often ask questions on our own initiative, to obtain key information that helps us understand the situation. In the process of conversation, we need to have a deep understanding of the previous conversations in order to answer each otherโ€™s questions correctly or ask meaningful new questions. Therefore, in this process, historical conversation information also becomes a part of the context.

In recent years, conversational machine reading comprehension (CMRC) has become a new research hotspot in the NLP community, and there emerged many related datasets, such as CoQA [49], QuAC [68], DREAM [83] and ShARC [39].

4.9.8. Domain-specific Datasets

์ปจํ…์ŠคํŠธ๊ฐ€ ํŠน์ • ๋„๋ฉ”์ธ์ธ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๊ณผํ•™์ด๋‚˜ ๋ฉ”๋””์ปฌ ๋ฆฌํฌํŠธ๊ฐ€ ์žˆ๋‹ค.

  • CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension (NAACL 2018)

4.9.9. MRC with Paraphrased Paragraph

๊ฐ™์€ ๋‚ด์šฉ์„ ๋‹ค๋ฅธ ๋ง๋กœ ๋ฐ”๊ฟ”์„œ ํ‘œํ˜„ํ•œ ์Œ์„ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.

4.9.10. Large-scale MRC Dataset

Large-scale dataset์€ ๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต์— ์šฉ์ดํ•˜๋‹ค.

4.9.11. MRC dataset for Open-Domain QA

The open-domain question answering was originally defined as finding answers in collections of unstructured documents.