【Note】Introduction to DNNs - Bili-Sakura/NOTES GitHub Wiki
- RNN & LSTM
- [AlexNet](#AlexNet:Pioneer on DNNs (2012))
- VGG
- GAN
- GoogLeNet
Wiki-RNN: https://en.wikipedia.org/w/index.php?title=Recurrent_neural_network&oldid=1220695644
Wiki-LSTM: https://en.wikipedia.org/w/index.php?title=Long_short-term_memory&oldid=1220425877
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems, 25. https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
AlexNet was designed by Hinton, winner of the 2012 ImageNet competition, and his student Alex Krizhevsky. It was also after that year that more and deeper neural networks were proposed, such as the excellent vgg, GoogleLeNet. Its official data model has an accuracy rate of 57.1% and top 1-5 reaches 80.2%. This is already quite outstanding for traditional machine learning classification algorithms.
[!NOTE] IMPORTANT ILSVRC-2012 Competition: Classification(1st)
Note: There is a typo error in the original work of AlexNet that the size of input images is
$227×227$ instead of$224×224$
The following table below explains the network structure of AlexNet:
Size / Operation | Filter | Depth | Stride | Padding | Number of Parameters | Forward Computation |
---|---|---|---|---|---|---|
3* 227 * 227 | ||||||
Conv1 + Relu | 11 * 11 | 96 | 4 | (11*11*3 + 1) * 96=34944 | (11*11*3 + 1) * 96 * 55 * 55=105705600 | |
96 * 55 * 55 | ||||||
Max Pooling | 3 * 3 | 2 | ||||
96 * 27 * 27 | ||||||
Norm | ||||||
Conv2 + Relu | 5 * 5 | 256 | 1 | 2 | (5 * 5 * 96 + 1) * 256=614656 | (5 * 5 * 96 + 1) * 256 * 27 * 27=448084224 |
256 * 27 * 27 | ||||||
Max Pooling | 3 * 3 | 2 | ||||
256 * 13 * 13 | ||||||
Norm | ||||||
Conv3 + Relu | 3 * 3 | 384 | 1 | 1 | (3 * 3 * 256 + 1) * 384=885120 | (3 * 3 * 256 + 1) * 384 * 13 * 13=149585280 |
384 * 13 * 13 | ||||||
Conv4 + Relu | 3 * 3 | 384 | 1 | 1 | (3 * 3 * 384 + 1) * 384=1327488 | (3 * 3 * 384 + 1) * 384 * 13 * 13=224345472 |
384 * 13 * 13 | ||||||
Conv5 + Relu | 3 * 3 | 256 | 1 | 1 | (3 * 3 * 384 + 1) * 256=884992 | (3 * 3 * 384 + 1) * 256 * 13 * 13=149563648 |
256 * 13 * 13 | ||||||
Max Pooling | 3 * 3 | 2 | ||||
256 * 6 * 6 | ||||||
Dropout (rate 0.5) | ||||||
FC6 + Relu | 256 * 6 * 6 * 4096=37748736 | 256 * 6 * 6 * 4096=37748736 | ||||
4096 | ||||||
Dropout (rate 0.5) | ||||||
FC7 + Relu | 4096 * 4096=16777216 | 4096 * 4096=16777216 | ||||
4096 | ||||||
FC8 + Relu | 4096 * 1000=4096000 | 4096 * 1000=4096000 | ||||
1000 classes | ||||||
Overall | 62369152=62.3 million | 1135906176=1.1 billion | ||||
Conv VS FC | Conv:3.7million (6%) , FC: 58.6 million (94% ) | Conv: 1.08 billion (95%) , FC: 58.6 million (5%) |
Practical guidances for neural network construction.
-
Rectified Linear Units (ReLUs), denoted as
$f(x)=max(0,x)$ , is proposed as a non-saturating activated function that accelerates the training process. Also, ReLUs have the desirable property that they do not require input normalization to prevent them from saturating.A four-layer convolutional neural network with ReLUs reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons.
-
Local normalization before activation aids generalization. Brightness normalization is proposed that replace the activity of a neuron (
$a^{i}_{x,y}$ ) with ($b^{i}_{x,y}$ ) by the expression (1) where kernel$i$ at position$(x,y)$ , the sum runs over$n$ “adjacent” kernel maps at the same spatial position, and$N$ is the total number of kernels in the layer.Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.
-
Overlapping pooling with
$s = 2$ and$z = 3$ reduces the top-1 and top-5 error rates by 0.4% and 0.3%, compared to$s = 2$ and$z = 2$ . Also, this technique makes network slightly more difficult to overfit. -
Data augmentation consists of extracting random 224×224 patches (and their horizontal reflections) from the 256×256 images and training the network on these extracted patches. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.
Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks.
Also, they perform Principle Component Analysis(PCA) on the set of RGB pixel values for data augmentation.
This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.
-
Dropout is adopted that set to zero the output of each hidden neuron with probability 50%.
This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.
"We also entered our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried. In the remainder of this paragraph, we use validation and test error rates interchangeably because in our experience they do not differ by more than 0.1% (see Table 2). The CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. Training one CNN, with an extra sixth convolutional layer over the last pooling layer, to classify the entire ImageNet Fall 2011 release (15M images, 22K categories), and then “fine-tuning” it on ILSVRC-2012 gives an error rate of 16.6%. Averaging the predictions of two CNNs that were pre-trained on the entire Fall 2011 release with the aforementioned five CNNs gives an error rate of 15.3%. The second-best contest entry achieved an error rate of 26.2% with an approach that averages the predictions of several classifiers trained on FVs computed from different types of densely-sampled features1."
- J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. ILSVRC-2012, 2012. URL http://www.image-net.org/challenges/LSVRC/2012/.

Model | Top-1(val) | Top-5(val) | Top-5(test) |
---|---|---|---|
(2nd Model) | / | / | 26.2% |
AlexNet | 36.7% | 15.4% | 15.3% |
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. https://www.semanticscholar.org/paper/Very-Deep-Convolutional-Networks-for-Large-Scale-Simonyan-Zisserman/eb42cf88027de515750f230b23b1a057dc782108
The full name of VGG is the Visual Geometry Group, which belongs to the Department of Science and Engineering of Oxford University. It has released a series of convolutional network models beginning with VGG, which can be applied to face recognition and image classification, from VGG16 to VGG19. The original purpose of VGG's research on the depth of convolutional networks is to understand how the depth of convolutional networks affects the accuracy and accuracy of large-scale image classification and recognition. In order to deepen the number of network layers and to avoid too many parameters, a small 3x3 convolution kernel is used in all layers.
[!NOTE] IMPORTANT ILSVRC-2014 Competition: Localization(1st) Classification(2nd)
The input of VGG is set to an RGB image of 224x244 size. The average RGB value is calculated for all images on the training set image, and then the image is input as an input to the VGG convolution network. A 3x3 or 1x1 filter is used, and the convolution step is fixed. . There are 3 VGG fully connected layers, which can vary from VGG11 to VGG19 according to the total number of convolutional layers + fully connected layers. The minimum VGG11 has 8 convolutional layers and 3 fully connected layers. The maximum VGG19 has 16 convolutional layers. +3 fully connected layers. In addition, the VGG network is not followed by a pooling layer behind each convolutional layer, or a total of 5 pooling layers distributed under different convolutional layers. The following figure is VGG Structure diagram:


- A smaller 3x3 convolution kernel and a deeper network are used . The stack of two 3x3 convolution kernels is relative to the field of view of a 5x5 convolution kernel, and the stack of three 3x3 convolution kernels is equivalent to the field of view of a 7x7 convolution kernel. In this way, there can be fewer parameters (3 stacked 3x3 structures have only 7x7 structural parameters (3x3x3) / (7x7) = 55%); on the other hand, they have more The non-linear transformation increases the ability of CNN to learn features.
- In the convolutional structure of VGGNet, a 1x1 convolution kernel is introduced. Without affecting the input and output dimensions, non-linear transformation is introduced to increase the expressive power of the network and reduce the amount of calculation.
- During training, first train a simple (low-level) VGGNet A-level network, and then use the weights of the A network to initialize the complex models that follow to speed up the convergence of training .


Ren, Y., & Li, X. (2023). Predicting the Daily Sea Ice Concentration on a Subseasonal Scale of the Pan-Arctic During the Melting Season by a Deep Learning Model. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–15. https://doi.org/10.1109/TGRS.2023.3279089
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper With Convolutions. Computer Vision and Pattern Recognition (CVPR), 1–9. https://openaccess.thecvf.com/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html
Recurrent neural network. (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Recurrent_neural_network&oldid=1220695644
Long short-term memory. (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Long_short-term_memory&oldid=1220425877
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems, 25. https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. https://www.semanticscholar.org/paper/Very-Deep-Convolutional-Networks-for-Large-Scale-Simonyan-Zisserman/eb42cf88027de515750f230b23b1a057dc782108
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper With Convolutions. Computer Vision and Pattern Recognition (CVPR), 1–9. https://openaccess.thecvf.com/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html