HotFlip: White Box Adversarial Examples for Text Classification - USC-LHAMa/CSCI544_Project GitHub Wiki

Abstract:

White-box adversarial examples to trick character-level classifier.

Basic strategy: Atomic flip operation to swap one token for another based on gradients of 1-hot vectors

Intro:

" HotFlip is a method for generating ad. ex. with character substitutions. ....supports insertion and deletion operations by representing them as sequences of character substitutions. uses gradient with repect to a 1-hot input representation to efficiently estimate which individual change has the highest estimated loss, and it uses a beam search to find a set of manipulations that work well together to confuse a classifier" "

Choose the vector with the biggest increase in First Order Loss , using the derivative as a proxy.

Use gradient loss to guide deletion/insertion.

Multiple Changes

Greedy beam search of r steps to give ad. ex. with a maximum of r flips.

Experiments

Applied on CharCNN-LSTM on the AG-news dataset with 120K training and 7600 testing instances.

Method Misc. error Success rate Baseline 8.27% 98.16%

Adv-tr (Miyato et al., 2017) 8.03% 87.43%

Adv-tr (black-box) 8.60% 95.63%

Adv-tr (white-box) 7.65% 69.32%

Table 2: Comparison based on misclassification

Observe the clear drop in performance.

Character-based flips rarely change the meaning of the sentence, based on evaluations by humans.

Hot Flip at Word-Level:

Make the embedding space induce in character-level representation more dense --> more likely to misbehave under small ad. perturbation.

Conclusion

Training on these ad. ex. renders the model more robust to attacks.