Adversarial Examples For Evaluating Reading Comprehension Systems - USC-LHAMa/CSCI544_Project GitHub Wiki
Introduction/TLDR
This paper focuses on trying to test whether systems can answer questions about paragraphs that have adversarially inserted sentences using the SQuAD dataset. These adversarial sentences are generated in order to confuse the model but does not change the correct answer. Through this adversarial setting, the accuracy of 16 published models drops from an average of 75% F1 score to 36%.
Introduction
The adversarial examples target model instability meaning it targets the inability of a model to distinguish a sentence that actually answers the question from a sentence that just has words in common with the question.
Tasks and models
In development and testing, this paper uses two published model architectures BiDAF and Match-LSTM both of which are DL architectures. Validation was done on 12 other published models (not gonna list those cause... too many).
Adversarial Evaluation
Models that rely on superficial cues without understanding actual language do well based on their F1 score. Even though this is the case, models are not able to go beyond this when presented with adversarial questions. Many models perform well when given a straight forward question but get confused when adversarial distracting sentences are added to the paragraph.
Concatenative Adversaries
Concatenative adversaries add a new sentence to the end of a paragraph and leave the question and answer unchanged. Valid adversarial examples are those where the sentence inserted do not contradict the correct answer. These types of sentences are considered compatible with (p,q,a) where p is the paragraph, q is the question and a is the answer. Current models are bad at distinguishing these sentences from sentences that address the question which means that they suffer from overstability and not oversensitivity meaning that the model is not able to distinguish a sentence that actually answers the question from one that just has words in common with it.
Two types of concatenative adversaries are ADDSENT and ADDANY.
ADDSENT
Adds grammatical sentences that look similar to the question Four steps to generating a sentence that looks similar to the question:
- Apply semantics-altering pertubations to the questions like replacing nouns and adjectives with antonyms
- Create fake answer that has the same type as the original answer
- Combine the altered question and fake answer into a declarative form. For example "What ABC division handles domenstic television distribution?" triggers a rule that converts questions of the form "what/which No problem VP?" to "The No problem of [Answer] VP"
- Fix errors in these sentences via crowdsourcing and filter out the bad sentences.
ADDANY
Adds arbitrary sequences of English words making it more confusing for models. The goal is to choose any sequence of d words, regardless of grammaticality. ADDANY requires more model access than ADDSENT and it could be that the sentences generated by this search procedure contradicts the original answer.
Algorithm:
- Intialize words w_1,...,w_d randomly from a list of common English words
- Run 6 epochs of local search each of which iterates over the indices {1,...,d} in a random order.
- For each i, randomly generate set of candidate words W as union of 20 randomly sampled common words and all words in q.
- For each w in W, generate the sentence with x in the ith position and w_j in the jth position for each j != i.
- Try adding each sentence to the paragraph and query the model for its predicted probability distribution over answers.
- Update w_i to be x that minimizes the expected value of the F1 score over the model's output distribution.
This experiment used a variant of ADDANY called ADDCOMMON where instead it only adds common words.
Experiments
All experiments measured adversarial F1 score across 1000 randomly sampled examples from SQuAD development set.
Results
ADDSENT made average F1 score across the 4 models fall from 75.7% to 31.3% ADDANY made average F1 score across the 4 models fall from 75.7% to 6.7% ADDCOMMON made average F1 score across 4 models fall from 65.7% to 46.1%
Analysis
Existing machine learning systems for reading comprehension do poorly under adversarial evaluation. We see that existing models are overly stable to perturbations that alter semantics.