1704.08847 - hassony2/inria-research-wiki GitHub Wiki
[arxiv 1704.08847] Parseval Networks: Improving Robustness to Adversarial Examples [PDF] [notes]
Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier
read 07/08/2017
Introduce Parseval networks, which force the Lipschitz constants of linear, convolutional and aggregation layers to be constrained to <1
Show that these networks are more robust to adverserial perturbations and tend to train faster and better use the full capacity of the networks
Controlling the lipschitz constants of the layers of a neural network allows for better generalization and robustness to adverserial examples.
Indeed, one can define
L(W) = E_{x, y in samples} [ l(g(x, W), y) ]
L_adv(W, p, eps) = E_{over samples (x, y)}[ max(l(g(x_hat, W), y) ] for ||x_hat - x||_p < eps
We have
L(W) < L_adv(W, p, eps)
and L_adv(W, p, eps) <= L(W) + l*k*eps where l and k are the (||.||_p) lipschitz constant of the loss l and the network function g respectively
Thus, controlling the lipschiptz of the network should allow for controlling the adverserial loss.
The lipschitz constant of a node are connected to it's child nodes (previous layers which output the node's inputs)
||n(x) - n(x_hat)||p <= \sum{n': n' child of n} lambda_p(n, n') || n'(x) - n'(x_hat)||_p
where n(x) is the output of the node given the input x (x is the input of the network)
Lambda_p(n', n) is the lipschitz constant for ||.||p of x-> phi(n)(W(n), (n''(xo + 1{n''=n'}(x-xo))) n'': ) of the layer that connects n' and n over the space of the inputs.
This function fixes the input of all the child layers to xo except for n', which input is set to x
We have
lambda(n)p <= \sum{n' child of n} lambda(n, n')_p * lambda(n')_p
n(x) = W(n)n'(x)
||W(n) ||p = sup{z:||z||_p=1} || W(n)z||_p
for p=2, this is the spectral norm of W(n), and therefore the maximum singular value of W(n)
We then have the relationship between consecutive nodes
lambda_2(n) = || W(n)||_2 lambda_2(n')
where n' is the unique child of n !
for p=inf lambda_inf(n) is the maximum 1-norm of the rows of W(n)
convolutions can be rewritten as the multiplication by a matrix
for this, the input has to be unfolded into a matrix that essentially unwinds and repeats the values of the input by an operator U
n(x) = W(n)U(n'(x))
U has a lispchitz constant of sqrt(width of the convolution)
therefore lambda(n)_2 <= ||W(n)||_2 sqrt(width of conv) lambda(n')_2
lambda(n)_inf <= ||W(n)||_inf lamda(n')_inf
For layers that perform the sum of their inputs
lambda_p(n,n') = 1
and thus lambda(n)p <= sum{n' in children n} lambda(n')_p
if n is an element-wise application or relu, we have lambda(n_p) <= lambda(n')_p as soon as the lipschitz constatn of the given function is <= 1
Idea : constrain the lipschitz constants of each layer to be <=1 so that the network lipschitz constant does not explode exponentially
We have to force the spectral norm of the weight matrix to stay below 1
This can be obtained by forcing WTW to roughly be I_{dout, d_out} (W \in R^{d_out, d_in}
The matrix W \in R^{d_out, width of conv} is rescaled by sqrt(width of conv)
This forces all singular values of W to be smaller then 1/sqrt(conv width), therefore lambda{n)_2 <= lambda(n')_2 as ||W(n)||_2 < 1
Keeping the rows of weight matrices orthogonal makes it possible to control both the 2 and inf norm of the weight matrix through the norm of it's individual rows. For thin inf norm, this is acheived by rescaling the rows.
Not simple sums but convex combination of them
n(x) = \sum_{n' in children of n} a(n, n') n'(x)
with \sum_{n'}(a(n, n')) = 1 and a(n, n') > 0
The parameters a(n, n') are learnt but given the constraints, they ensure lambda(n)_p<1 is the children each verify this same inequality
This is obtained essentially through something resembling soft theresholding, which achieves the projection of the vector onto the vectors of positive coefficients summign to 1
Weights of the matrixes are optimized on the manifold of orthogonal matrixes (Stiefel manifold)
Simplest optimization : combining steepest descent while staying on the manifold by using a retraction operator. This is computationnaly prohibitive when exact, so approximated for each layer by minimizing
R_{\beta}(W_k) = \beta/2 ||W_kTW_k - I||_2^2
But making R_{\beta} converge is still expensive and might take us far from gradient update
To speed-up : we only do one update of W_k with relation to R_{\beta}
Optionnaly, only a subset of the rows can be chosen for the update
Experimentally, we observe that this ensures the quasi-orthogonality of the matrix
On MNIST - CIFAR_10, CIFAR-100 and SVHN (Street view house numbers)
Train both fully connected and wide residual networks (which perform as well as ResNets while ensuring faster training through reduced depth)
Comparison between vanilla (with standard wight decay regularization and batch norm + dropout) and Parseval versions of the network
Also experimented with adverserial training training , replacing 50% of each mini-batch with random eps adverserial samples
For fully connected networks, only used a subset of 30% of the rows of the weight matrix at each iteration
The regularization indeed yields singular values for each layer close to 1 (vs widely spread for SGD and mostly 0 for SGD with weight decay)
Using Parseval network matches state of the art vanilla netwrks on CIFAR-10 and SVNH (81 and 98% accuracy respectively)
Thus Parseval regularization is legitimate for clean samples (does not degrade performances)
Use fast gradient sign method
In absence of adverserial training, parseval networks outperform vanilla ones (for SNR of 40, 55 vs 45% accuracy)
In adverserial setting, Parseval mostly outperforms vanilla ones. In low noise setting, Parseval training is not much boosted by adverserial training, which shows it's importance for high noise settings.
Parseval networks achive robustness to adverserial samples located in vicinity of data points, adverserial training helps when adverserial examples are further away from legitimate patterns
Frame
{f_n}n\ in J where f_n are vectors and J is a countable index that have the following property
$\exists A, B >0, \forall f in H, A ||f||2 < \sum{n \in J} |<f_n, f>|^2 < B ||f||_2$
Any orthogonal frame is a Parseval frame (Parseval's equality
tight frame
A = B
Parseval tight frame
A=B=1
log-loss <=> cross entropy loss <=> softmax + negative log likelyhood
l(x, class) =- log(exp(x[class])/ (\sum_i exp(x[i]) ) ) = - x[class] + log(sum_i exp(x[i])
with x being the scores of each target class and class is the index of the correct (ground truth) class