1611.02770 - hassony2/inria-research-wiki GitHub Wiki
ICLR 2017
[arxiv 1611.02770] Delving into transferable adverserial examples and black-box attacks [PDF] [notes]
Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song
read 06/08/2017
Objective
Study transferability of untargeted and targeted adverserial examples over large datasets
Synthesis
Untargeted adverserial examples are easy to generate when the model (neural network) is known
When the model is unknown (black-box), attacks can focus on transferability : generating an adverserial example for a known model and see if it successfully misleads the black-box model.
Non-targeted attacks often transfer, while targeted attacks almost never transfer with their label
The idea is to try to craft adverserial images using an ensemble of multiple models, and hope the attack generalizes to unseen models, this approach manages to achieve successfull transfers of targeted attacks
When generating adverserial examples, the goal is to find x_hat so that
f_theta(x_hat) != y where y is the ground truth label, and f_theta is a model composed of a network J_theta which outputs probabilities so that f outputs the category with the highest probability
under the constraint d(x, x_hat) < epsilon, where d quantifies the distance between the inputs
Untargeted attacks
Optimization based approach
find argmin_x lambda d(x, x_hat) - l(1_y, J_theta(x_hat))
where 1_y is the one-hot encoding of y and lambda balances the distance constraint. In this work, they use l(u, v) = log(1 - u·v) which varies between 0 and -inf for u·v in 0, 1, therefore if the class is correctly predicted, u·v is close to 1 and l(1 - u·v) is -inf, making - l(1 - u·v) + inf, to minimize the function we therfore need u·v close to 0 and therefore x_hat to be misclassified
This appraach is optimized by initializing x_hat at x and using Adam, the distorsion can be controled by tuning the learning rate of Adam and lambda
In practice, lambda is set to 0 !
100 iterations to generate adverserial example
Fast gradient sign
x_hat = clip(x + eps sign(grad_x( l(1_y, J_theta(x) ) ) ) where clip brings back the values to the [0, 255] pixel space
Fast gradient
x_hat = clip(x + eps grad_x(l(1_y, J_theta(x)) / || grad_x ( l( 1_y, J_theta(x) ) ) || )
Here the distance metric is a norm on x - x_hat : d(x, x_hat) = || x - x_hat ||
Targeted attacks
the constraint is now f_theta(x_hat) = y_hat where y_hat is the target label
Optimization based approach
We now solve
argmin lambda d(x, x_hat) + l'(1_{y_hat}, J_theta(x_hat) )
Fast Gradient Sign
x_hat = clip (x - eps * sign( l'(grad_x(1_{y_hat}, J_theta(x) )) ) )
Fast Gradient
x_hat = clip (x - eps * grad_x(1_{y_hat}, J_theta(x) )/ || grad_x(l'(1_{y_hat}, j_theta(x) )|| )
distorsion : d(x_hat, x) = sqrt(sum_i((x^i_hat - x^i)^2/N)
Experiments
Test set : 100 images which can be classified correctly by all five models (ResNet-50/101/152, GoogLeNet, VGG-16)
Results
Non targeted attacks
Non targeted attacks are pretty well transferable (between 90 and 60% success between best and worse model transfers for optimization bade approach)
for Fast gradient approach between 93 and 75% success
Note : fast gradient methods, being speed-oriented approximations, fail at creating adverserial examples in 1-4% of the cases (generated and tested on same model)
Fast Gradient performs better then Fast Gradient Sign
When allowed large distorsions, 100% of the samples can be transfered from VGG-16 to ResNet 152 using Fast Gradient based methods
Targeted attacks
Very poor transferability (1%) in the targeted cases in the case of the optimization based approach, even with increased distorsion (which is obtained by tweaking Adam parameters). Same is observed for gradient-based methods
Random noise
The label obtained by adding random noise does not transfer from model to model (random noise can produce untargeted attacks, but not targeted ones)
Ensemble approaches
The idea is to generate adverserial images for a set of models
For instance for the targeted attack, we have to solve
argmin_{x_hat} - log (sum_i{a_i J_i(x_hat) · 1_{y_hat} ) + lambda d(x, x_hat) })
where J_i are models, a_i ensemble weights
For the optimization based approach, at each step the Adam updates are summed onto the image, and this is repeated 100 times
Non-targeted attacks have almost perfect transferability in the case of the optimization based technique
Notes
Boundaries
Interesting visualization of classes when perturbing the input which shows that not all the classes are present in an image plane constituted of one random direction and the gradient one (20 classes present out of the 100 from ImageNet)
And indeed, in the direction of the gradient, the class soon changes