1611.02770 - hassony2/inria-research-wiki GitHub Wiki

ICLR 2017

[arxiv 1611.02770] Delving into transferable adverserial examples and black-box attacks [PDF] [notes]

Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song

read 06/08/2017

Objective

Study transferability of untargeted and targeted adverserial examples over large datasets

Synthesis

Untargeted adverserial examples are easy to generate when the model (neural network) is known

When the model is unknown (black-box), attacks can focus on transferability : generating an adverserial example for a known model and see if it successfully misleads the black-box model.

Non-targeted attacks often transfer, while targeted attacks almost never transfer with their label

The idea is to try to craft adverserial images using an ensemble of multiple models, and hope the attack generalizes to unseen models, this approach manages to achieve successfull transfers of targeted attacks

When generating adverserial examples, the goal is to find x_hat so that

f_theta(x_hat) != y where y is the ground truth label, and f_theta is a model composed of a network J_theta which outputs probabilities so that f outputs the category with the highest probability

under the constraint d(x, x_hat) < epsilon, where d quantifies the distance between the inputs

Untargeted attacks

Optimization based approach

find argmin_x lambda d(x, x_hat) - l(1_y, J_theta(x_hat))

where 1_y is the one-hot encoding of y and lambda balances the distance constraint. In this work, they use l(u, v) = log(1 - u·v) which varies between 0 and -inf for u·v in 0, 1, therefore if the class is correctly predicted, u·v is close to 1 and l(1 - u·v) is -inf, making - l(1 - u·v) + inf, to minimize the function we therfore need u·v close to 0 and therefore x_hat to be misclassified

This appraach is optimized by initializing x_hat at x and using Adam, the distorsion can be controled by tuning the learning rate of Adam and lambda

In practice, lambda is set to 0 !

100 iterations to generate adverserial example

Fast gradient sign

x_hat = clip(x + eps sign(grad_x( l(1_y, J_theta(x) ) ) ) where clip brings back the values to the [0, 255] pixel space

Fast gradient

x_hat = clip(x + eps grad_x(l(1_y, J_theta(x)) / || grad_x ( l( 1_y, J_theta(x) ) ) || )

Here the distance metric is a norm on x - x_hat : d(x, x_hat) = || x - x_hat ||

Targeted attacks

the constraint is now f_theta(x_hat) = y_hat where y_hat is the target label

Optimization based approach

We now solve

argmin lambda d(x, x_hat) + l'(1_{y_hat}, J_theta(x_hat) )

Fast Gradient Sign

x_hat = clip (x - eps * sign( l'(grad_x(1_{y_hat}, J_theta(x) )) ) )

Fast Gradient

x_hat = clip (x - eps * grad_x(1_{y_hat}, J_theta(x) )/ || grad_x(l'(1_{y_hat}, j_theta(x) )|| )

distorsion : d(x_hat, x) = sqrt(sum_i((x^i_hat - x^i)^2/N)

Experiments

Test set : 100 images which can be classified correctly by all five models (ResNet-50/101/152, GoogLeNet, VGG-16)

Results

Non targeted attacks

Non targeted attacks are pretty well transferable (between 90 and 60% success between best and worse model transfers for optimization bade approach)

for Fast gradient approach between 93 and 75% success

Note : fast gradient methods, being speed-oriented approximations, fail at creating adverserial examples in 1-4% of the cases (generated and tested on same model)

Fast Gradient performs better then Fast Gradient Sign

When allowed large distorsions, 100% of the samples can be transfered from VGG-16 to ResNet 152 using Fast Gradient based methods

Targeted attacks

Very poor transferability (1%) in the targeted cases in the case of the optimization based approach, even with increased distorsion (which is obtained by tweaking Adam parameters). Same is observed for gradient-based methods

Random noise

The label obtained by adding random noise does not transfer from model to model (random noise can produce untargeted attacks, but not targeted ones)

Ensemble approaches

The idea is to generate adverserial images for a set of models

For instance for the targeted attack, we have to solve

argmin_{x_hat} - log (sum_i{a_i J_i(x_hat) · 1_{y_hat} ) + lambda d(x, x_hat) })

where J_i are models, a_i ensemble weights

For the optimization based approach, at each step the Adam updates are summed onto the image, and this is repeated 100 times

Non-targeted attacks have almost perfect transferability in the case of the optimization based technique

Notes

Boundaries

Interesting visualization of classes when perturbing the input which shows that not all the classes are present in an image plane constituted of one random direction and the gradient one (20 classes present out of the 100 from ImageNet)

And indeed, in the direction of the gradient, the class soon changes