Microsoft ResNet - rugbyprof/5443-Data-Mining GitHub Wiki
Citation
@article{He2015,
author = {Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun},
title = {Deep Residual Learning for Image Recognition},
journal = {arXiv preprint arXiv:1512.03385},
year = {2015}
}
Deep neural network
A deep neural network (DNN) is an ANN with multiple hidden layers between the input and output layers.DNN architectures generate compositional models where the object is expressed as a layered composition of primitives.The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network. (https://en.wikipedia.org/wiki/Deep_learning#Deep_neural_networks)
Result on ImageNet
model | top-1 | top-5 |
---|---|---|
VGG-16 | 28.5% | 9.9% |
ResNet-50 | 24.7% | 7.8% |
ResNet-101 | 23.6% | 7.1% |
ResNet-152 | 23.0% | 6.7% |
1.Degradation.
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.
2.Residual Learning.
Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2 , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.
3.Identity Mapping by Shortcuts.
Here x and y are the input and output vectors of the layers considered. The function F(x, {Wi}) represents the residual mapping to be learned. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition. The shortcut connections in Eqn introduce neither extra parameter nor computation complexity.
The dimensions of x and F must be equal in first example. If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers, while more layers are possible. But if F has only a single layer, First example is similar to a linear layer: y = W1x + x, for which we have not observed advantages.
4.Layers of Resnet.
Residual nets with a depth of up to 152 layers—8×deeper than VGG nets [40] but still having lower complexity