HotFlip: White Box Adversarial Examples for Text Classification - USC-LHAMa/CSCI544_Project GitHub Wiki
Abstract:
White-box adversarial examples to trick character-level classifier.
Basic strategy: Atomic flip operation to swap one token for another based on gradients of 1-hot vectors
Intro:
" HotFlip is a method for generating ad. ex. with character substitutions. ....supports insertion and deletion operations by representing them as sequences of character substitutions. uses gradient with repect to a 1-hot input representation to efficiently estimate which individual change has the highest estimated loss, and it uses a beam search to find a set of manipulations that work well together to confuse a classifier" "
Choose the vector with the biggest increase in First Order Loss , using the derivative as a proxy.
Use gradient loss to guide deletion/insertion.
Multiple Changes
Greedy beam search of r steps to give ad. ex. with a maximum of r flips.
Experiments
Applied on CharCNN-LSTM on the AG-news dataset with 120K training and 7600 testing instances.
Method Misc. error Success rate Baseline 8.27% 98.16%
Adv-tr (Miyato et al., 2017) 8.03% 87.43%
Adv-tr (black-box) 8.60% 95.63%
Adv-tr (white-box) 7.65% 69.32%
Table 2: Comparison based on misclassification
Observe the clear drop in performance.
Character-based flips rarely change the meaning of the sentence, based on evaluations by humans.
Hot Flip at Word-Level:
Make the embedding space induce in character-level representation more dense --> more likely to misbehave under small ad. perturbation.
Conclusion
Training on these ad. ex. renders the model more robust to attacks.