Microsoft ResNet - rugbyprof/5443-Data-Mining GitHub Wiki

Citation

@article{He2015,
	author = {Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun},
	title = {Deep Residual Learning for Image Recognition},
	journal = {arXiv preprint arXiv:1512.03385},
	year = {2015}
}

Deep neural network

A deep neural network (DNN) is an ANN with multiple hidden layers between the input and output layers.DNN architectures generate compositional models where the object is expressed as a layered composition of primitives.The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network. (https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks)

Result on ImageNet

model	top-1	top-5
VGG-16	28.5%	9.9%
ResNet-50	24.7%	7.8%
ResNet-101	23.6%	7.1%
ResNet-152	23.0%	6.7%

1.Degradation.

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.

2.Residual Learning.

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

3.Identity Mapping by Shortcuts.

Here x and y are the input and output vectors of the layers considered. The function F(x, {Wi}) represents the residual mapping to be learned. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition. The shortcut connections in Eqn introduce neither extra parameter nor computation complexity.

The dimensions of x and F must be equal in first example. If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers, while more layers are possible. But if F has only a single layer, First example is similar to a linear layer: y = W1x + x, for which we have not observed advantages.

4.Layers of Resnet.

Residual nets with a depth of up to 152 layers—8×deeper than VGG nets [40] but still having lower complexity