1704.08847 - hassony2/inria-research-wiki GitHub Wiki

ICML 2017

[arxiv 1704.08847] Parseval Networks: Improving Robustness to Adversarial Examples [PDF] [notes]

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier

read 07/08/2017

Objective

Introduce Parseval networks, which force the Lipschitz constants of linear, convolutional and aggregation layers to be constrained to <1

Show that these networks are more robust to adverserial perturbations and tend to train faster and better use the full capacity of the networks

Synthesis

Controlling the lipschitz constants of the layers of a neural network allows for better generalization and robustness to adverserial examples.

Indeed, one can define

L(W) = E_{x, y in samples} [ l(g(x, W), y) ]

L_adv(W, p, eps) = E_{over samples (x, y)}[ max(l(g(x_hat, W), y) ] for ||x_hat - x||_p < eps

We have

L(W) < L_adv(W, p, eps)

and L_adv(W, p, eps) <= L(W) + l*k*eps where l and k are the (||.||_p) lipschitz constant of the loss l and the network function g respectively

Thus, controlling the lipschiptz of the network should allow for controlling the adverserial loss.

Lipschitz constants of neural networks

The lipschitz constant of a node are connected to it's child nodes (previous layers which output the node's inputs)

||n(x) - n(x_hat)||p <= \sum{n': n' child of n} lambda_p(n, n') || n'(x) - n'(x_hat)||_p

where n(x) is the output of the node given the input x (x is the input of the network)

Lambda_p(n', n) is the lipschitz constant for ||.||p of x-> phi(n)(W(n), (n''(xo + 1{n''=n'}(x-xo))) n'': ) of the layer that connects n' and n over the space of the inputs.

This function fixes the input of all the child layers to xo except for n', which input is set to x

We have

lambda(n)p <= \sum{n' child of n} lambda(n, n')_p * lambda(n')_p

Lipschitz constants

Linear layer

n(x) = W(n)n'(x)

||W(n) ||p = sup{z:||z||_p=1} || W(n)z||_p

for p=2, this is the spectral norm of W(n), and therefore the maximum singular value of W(n)

We then have the relationship between consecutive nodes

lambda_2(n) = || W(n)||_2 lambda_2(n')

where n' is the unique child of n !

for p=inf lambda_inf(n) is the maximum 1-norm of the rows of W(n)

Convolutional layers

convolutions can be rewritten as the multiplication by a matrix

for this, the input has to be unfolded into a matrix that essentially unwinds and repeats the values of the input by an operator U

n(x) = W(n)U(n'(x))

U has a lispchitz constant of sqrt(width of the convolution)

therefore lambda(n)_2 <= ||W(n)||_2 sqrt(width of conv) lambda(n')_2

lambda(n)_inf <= ||W(n)||_inf lamda(n')_inf

Aggregation layers

For layers that perform the sum of their inputs

lambda_p(n,n') = 1

and thus lambda(n)p <= sum{n' in children n} lambda(n')_p

if n is an element-wise application or relu, we have lambda(n_p) <= lambda(n')_p as soon as the lipschitz constatn of the given function is <= 1

Parseval networks

Idea : constrain the lipschitz constants of each layer to be <=1 so that the network lipschitz constant does not explode exponentially

Linear layers

We have to force the spectral norm of the weight matrix to stay below 1

This can be obtained by forcing WTW to roughly be I_{dout, d_out} (W \in R^{d_out, d_in}

Convolutional layers

The matrix W \in R^{d_out, width of conv} is rescaled by sqrt(width of conv)

This forces all singular values of W to be smaller then 1/sqrt(conv width), therefore lambda{n)_2 <= lambda(n')_2 as ||W(n)||_2 < 1

Keeping the rows of weight matrices orthogonal makes it possible to control both the 2 and inf norm of the weight matrix through the norm of it's individual rows. For thin inf norm, this is acheived by rescaling the rows.

Aggregation layers

Not simple sums but convex combination of them

n(x) = \sum_{n' in children of n} a(n, n') n'(x)

with \sum_{n'}(a(n, n')) = 1 and a(n, n') > 0

The parameters a(n, n') are learnt but given the constraints, they ensure lambda(n)_p<1 is the children each verify this same inequality

This is obtained essentially through something resembling soft theresholding, which achieves the projection of the vector onto the vectors of positive coefficients summign to 1

Orthonormality constraint

Weights of the matrixes are optimized on the manifold of orthogonal matrixes (Stiefel manifold)

Simplest optimization : combining steepest descent while staying on the manifold by using a retraction operator. This is computationnaly prohibitive when exact, so approximated for each layer by minimizing

R_{\beta}(W_k) = \beta/2 ||W_kTW_k - I||_2^2

But making R_{\beta} converge is still expensive and might take us far from gradient update

To speed-up : we only do one update of W_k with relation to R_{\beta}

Optionnaly, only a subset of the rows can be chosen for the update

Experimentally, we observe that this ensures the quasi-orthogonality of the matrix

Experiments

On MNIST - CIFAR_10, CIFAR-100 and SVHN (Street view house numbers)

Train both fully connected and wide residual networks (which perform as well as ResNets while ensuring faster training through reduced depth)

Comparison between vanilla (with standard wight decay regularization and batch norm + dropout) and Parseval versions of the network

Also experimented with adverserial training training , replacing 50% of each mini-batch with random eps adverserial samples

For fully connected networks, only used a subset of 30% of the rows of the weight matrix at each iteration

Results

Orthonormality

The regularization indeed yields singular values for each layer close to 1 (vs widely spread for SGD and mostly 0 for SGD with weight decay)

Regularization

Using Parseval network matches state of the art vanilla netwrks on CIFAR-10 and SVNH (81 and 98% accuracy respectively)

Thus Parseval regularization is legitimate for clean samples (does not degrade performances)

Adverserial resistance

Use fast gradient sign method

In absence of adverserial training, parseval networks outperform vanilla ones (for SNR of 40, 55 vs 45% accuracy)

In adverserial setting, Parseval mostly outperforms vanilla ones. In low noise setting, Parseval training is not much boosted by adverserial training, which shows it's importance for high noise settings.

Parseval networks achive robustness to adverserial samples located in vicinity of data points, adverserial training helps when adverserial examples are further away from legitimate patterns

Notes

Parseval tight frame of a Hilbert space H

Frame

{f_n}n\ in J where f_n are vectors and J is a countable index that have the following property

$\exists A, B >0, \forall f in H, A ||f||2 < \sum{n \in J} |<f_n, f>|^2 < B ||f||_2$

Any orthogonal frame is a Parseval frame (Parseval's equality $||x||^2 = \sum_{v _in B} &lt;x, v&gt;$ where B is a base of H)

tight frame

A = B

Parseval tight frame

A=B=1

Losses

log-loss <=> cross entropy loss <=> softmax + negative log likelyhood

l(x, class) =- log(exp(x[class])/ (\sum_i exp(x[i]) ) ) = - x[class] + log(sum_i exp(x[i])

with x being the scores of each target class and class is the index of the correct (ground truth) class

⚠️ **GitHub.com Fallback** ⚠️