Introduction - Simsso/NIPS-2018-Adversarial-Vision-Challenge GitHub Wiki

Summary of a lecture by Ian Goodfellow at Stanford University

What are Adversarial Examples?

an example input that has been carefully constructed in order to fool the network into making a wrong decision

not just in CNNs, but also in logistic regression, SVMs, even in nearest neighbor algorithms
there are also fun adversarial examples for the human brain (Pinna and Gregory, 2002)

if the model has more parameters than it needs to fit the training data, it's susceptible to misclassify new inputs
but: If this were true, adversarial examples would be random and therefore unique to a network. Experiments have shown that the opposite is true however.

adversarial examples are transferable across networks and across network architectures
if the delta |x - x*| of a clean image x and an adversarial example x* is added to another clean image y, the resulting image is often also adversarial

in modern deep networks, the mapping from input to output is actually very piece-wise linear (e.g. due to activation functions like ReLU)
however, the mapping from parameters to output is very complex - which is why training is not easy
the near-linear mapping from input to output makes adjusting input images to the output (the inverse of training) very easy
above, one can see the logits values for specific classes - which behave very linearly as one changes the input image (a car) by eps * (small perturbation)

==> here, they found a perturbation direction that was associated with the frog class

idea: maximize the loss that a given input image causes in the network => calculate the gradient with respect to the input image and add it to the input image:

the sign function here enforces the epsilon-constraint (the perturbation's max norm must be ≤ epsilon)

legend on the left: FGSM vector is left-to-right and a random orthogonal direction is top-to-bottom (both applied by -eps to +eps)
on the right, the resulting 2D classification map (colors means incorrect class, white means correct class) of different CIFAR-10 are shown
observations:
- in most of the images, half of the map is classified correctly with a near-linear boundary
- FGSM has identified a direction where if we get a large dot-product with this direction, we can get an adversarial example
- adversarial examples live in linear subspaces (not tiny points in the input space) => all nearby images are also adversarial examples
how many dimensions do these adversarial subspaces have?
- on average: 25 (on MNIST where you have 28^2 = 784 total input dimensions)
- this tells you how likely you are to find an adversarial example from random noise
- also, the larger the subspaces for two models, the more likely it is that they intersect => transferable examples

intuition: the model learns some distribution of training examples that seem "natural"
with an adversarial example, one leaves this "natural" distribution which the network can't handle

when using the FGSM attack on these quadratic networks, you actually transform the image into another class
- ==> not technically an adversarial example
however: RBFs have very poor performance

basis: cross-model and cross-dataset transferability
idea: attacker wants to fool a network that he has no information about (architecture, type, dataset, ...)
attack:
1. train your own model mimicking target model
2. create adversarial example for own model
3. deploy adversarial example against the target
in practice, about 70% of examples transfer cross-dataset

idea: use an ensemble of many different models in order to create adversarial examples (Liu et al. 2016)
=> probability is almost 100% that the attack will be successful on another (target) model

neural nets can represent any function, but max-likelihood does not cause them to learn the right one
idea: train on adversarial examples
- works quite well (for FGSM attacks), but not for other, iterative attacks
- interesting effect: training on adversarial examples makes the classification task better (it can be seen as a kind of regularization)
these adversarially trained networks have the best empirical success rate

use unlabeled data: use model guess for an image, create adversarial perturbation intended to change the guess
idea of semi-supervised training: use labeled and unlabeled data

on the one hand, it prevents attackers from causing the network to make a wrong decision
but on the other hand, one could use it to design molecules, fast cars, new circuits, etc.
- why? because then the attacks do not create adversarial examples that trick the network, but instead apply a perturbation to the input that "makes sense"
- example: use the blueprint of a car as input, train network to guess the car's speed, apply perturbation to the blueprint that increases the speed the network assigns to the blueprint => get a blueprint of a fast car instead of an adversarial example