Adversarial Attacks - Simsso/NIPS-2018-Adversarial-Vision-Challenge GitHub Wiki

Attacks can be transferred across models. That's a requirement for black box attacks.

Random Search

Random search around a sample. We have to get a sense for the likelihood of hitting an adversarial example with random search. If that's possible within a reasonable amount of time, it would help use to better validate the stability of a model because the gradient points only locally into the right direction but behind a hill on the other side there might be adversarial examples as well.

L-BFGS

Szegedy et al (2014)

Fast Gradient Sign Method (FGSM)

Finds an attack by computing the gradient of the loss with respect to the input. The input is altered according to this gradient. Introduced by Goodfellow, Shlens, Szegedy (2014).

\boldsymbol\eta=\epsilon\text{sign}(\nabla_\boldsymbol xJ(\boldsymbol\theta,\boldsymbol x, y))

image image
Input x (left) and the gradient of the class "0" (right)

Iterated FGSM

The iterated fast gradient sign method (I-FGSM) repeats the gradient computation several times. The image update rule is defined as:

\boldsymbol{X}^{\text{adv}}{N+1} = \text{Clip}{X, \epsilon}\left{ \boldsymbol{X}^{\text{adv}}{N} + \alpha \text{sign} \left( \nabla_X J(\boldsymbol{X}^{\text{adv}}{N}, y_{true})  \right) \right}
where the base case is

\boldsymbol{X}^{\text{adv}}_{0} = \boldsymbol{X}

Least Likely Class

[FGSM and I-FGSM] are sufficient for application to datasets such as MNIST and CIFAR-10, where the number of classes is small and all classes are highly distinct from each other. On ImageNet, with a much larger number of classes and the varying degrees of significance in the difference between classes, these methods can result in uninteresting misclassifications.

This attack is targeted and aims to find an adversarial example that is classified as the least likely class for an image x.

The image update rule is very similar to I-FGSM (same base case), except the goal is to make the model predict the least likely (LL) class:

y_{\text{target}} = \arg\min_{y} \left{ p( y \mid \boldsymbol{X} ) \right}

\boldsymbol{X}^{\text{adv}}{N+1} = \text{Clip}{X, \epsilon}\left{ \boldsymbol{X}^{\text{adv}}{N} - \alpha \text{sign} \left( \nabla_X J(\boldsymbol{X}^{\text{adv}}{N}, y_{\text{target})  \right) \right}

Compared to FGSM and I-FGSM, this method is more effective.

image
Kurakin, Goodfellow, Bengio (2016)

Projected Gradient Descent (PGD)

This defense has been proposed by Madry et al. (2017) in their paper Towards Deep Learning Models Resistant to Adversarial Attacks. Explanation taken from here:

Initializing the search for an adversarial example at a random point within the allowed norm ball, then running several iterations of Least Likely Class (Kurakin et al., 2017; described above) to find an adversarial example.

JSMA

Paper Train n classifiers with the same architecture on the same data. Add the losses and perform gradient ascend on the loss wrt. the input. How long does it take on average to find an adversarial example.

Adversarial Transformation Networks

Paper

The Basic Idea

Instead of applying the gradient on the input from some network to the input image like FGSM does it, the approach of Adversarial Transformation Networks (ATNs) is to train a separate (convolutional) neural network which produces adversarial images as outputs. Transformations made by the ATN are not general; they are tied to the network it is trained to attack.

Architecture and Setup

An ATN is defined as a function

ATN network, where f is the target network and theta are the network parameters. The goal is that x is very similar to x', but the classification by f is different:

argmax_unequal

Loss

The loss function is given by

ATN loss

Here:

  • L_x is the standard L2-loss between the adversarial and clean images.
  • L_y is a bit more complex. Broadly speaking, it returns how well the probability distribution given by the target network f on the adversarial image matches an adversarial probability distribution (with the target class as highest probability and the other classes' order the same as on the clean image). The key idea is, not to use a one-hot encoding for the target probability distribution, but use the current uncertainty of the model as a basis.
  • Alpha and Beta are just hyperparameters that determine the emphasis of L_x vs. L_y.

Discussion

ATNs generate a very different type of adversarial examples than gradient-based techniques such as FGSM. As presented in the paper, this implies that it is necessary to train one network for each adversarial target-class.

The transformations maintain the large, empty regions of the image. Unlike many previous studies in attacking classifiers, the addition of salt-and-pepper type noise did not appear

Momentum Iterative Gradient-Based Methods (NIPS 2017 attack winner)

The Basic Idea

Instead of making only one gradient-step (FGSM) or many iterative but independent small steps (iterative FGSM), the idea here is to accumulate a velocity vector in the gradient direction of the loss across iterations.

Algorithm

In the paper, the following pseudo-code is given:

MI-FGSM

This satisfies the L-infinity norm constraint (with maximum perturbation epsilon). As one can see, one accumulates the gradients in each iteration in g_t, weighted by a decay factor ยต.

Discussion

The results in the paper suggest that the MI-FGSM method boosts the success rates of adversarial examples.

The reason by which the authors try to explain this fact is that one-step based methods like FGSM assume a linear decision boundary in the input space. In practice though, this linearity assumption does not always hold, which limits the effectiveness of one-step approaches. While standard iterative methods of course solve this problem, they are - like neural network optimization in general - prone to being stuck at local maxima (of the loss function).

Here, the momemtum-based approach might save the day, because it stabilizes update directions and may escape local maxima.

The authors further describe how they adapt their MI-FGSM attack for an ensemble of models. The gist is that they fuse the logit activation of the different models by a weighted sum. Details can be found in section 3.2 in the paper.


Weblinks