Introduction to DNNs

Latex Live
Documentation of Latex on Overleaf

Outline

RNN & LSTM
[AlexNet](#AlexNet:Pioneer on DNNs (2012))
VGG
GAN
GoogLeNet

RNN & LSTM

Wiki-RNN: https://en.wikipedia.org/w/index.php?title=Recurrent_neural_network&oldid=1220695644

Wiki-LSTM: https://en.wikipedia.org/w/index.php?title=Long_short-term_memory&oldid=1220425877

Introduction

Architecture

Highlights

AlexNet:Pioneer on DNNs (2012)

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems, 25. https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

Introduction

AlexNet was designed by Hinton, winner of the 2012 ImageNet competition, and his student Alex Krizhevsky. It was also after that year that more and deeper neural networks were proposed, such as the excellent vgg, GoogleLeNet. Its official data model has an accuracy rate of 57.1% and top 1-5 reaches 80.2%. This is already quite outstanding for traditional machine learning classification algorithms.

[!NOTE] IMPORTANT ILSVRC-2012 Competition: Classification(1st)

Architecture

alexnet

Figure. An illustration of the architecture of AlexNet(the original)

Note: There is a typo error in the original work of AlexNet that the size of input images is $227×227$ instead of $224×224$

alexnet

Figure. An enhanced illustration of the architecture of AlexNet

Parameters

The following table below explains the network structure of AlexNet:

Size / Operation	Filter	Depth	Stride	Padding	Number of Parameters	Forward Computation
3* 227 * 227
Conv1 + Relu	11 * 11	96	4		(11113 + 1) * 96=34944	(11113 + 1) * 96 * 55 * 55=105705600
96 * 55 * 55
Max Pooling	3 * 3		2
96 * 27 * 27
Norm
Conv2 + Relu	5 * 5	256	1	2	(5 * 5 * 96 + 1) * 256=614656	(5 * 5 * 96 + 1) * 256 * 27 * 27=448084224
256 * 27 * 27
Max Pooling	3 * 3		2
256 * 13 * 13
Norm
Conv3 + Relu	3 * 3	384	1	1	(3 * 3 * 256 + 1) * 384=885120	(3 * 3 * 256 + 1) * 384 * 13 * 13=149585280
384 * 13 * 13
Conv4 + Relu	3 * 3	384	1	1	(3 * 3 * 384 + 1) * 384=1327488	(3 * 3 * 384 + 1) * 384 * 13 * 13=224345472
384 * 13 * 13
Conv5 + Relu	3 * 3	256	1	1	(3 * 3 * 384 + 1) * 256=884992	(3 * 3 * 384 + 1) * 256 * 13 * 13=149563648
256 * 13 * 13
Max Pooling	3 * 3		2
256 * 6 * 6
Dropout (rate 0.5)
FC6 + Relu					256 * 6 * 6 * 4096=37748736	256 * 6 * 6 * 4096=37748736
4096
Dropout (rate 0.5)
FC7 + Relu					4096 * 4096=16777216	4096 * 4096=16777216
4096
FC8 + Relu					4096 * 1000=4096000	4096 * 1000=4096000
1000 classes
Overall					62369152=62.3 million	1135906176=1.1 billion
Conv VS FC					Conv:3.7million (6%) , FC: 58.6 million (94% )	Conv: 1.08 billion (95%) , FC: 58.6 million (5%)

### **Highlights**

Practical guidances for neural network construction.

Rectified Linear Units (ReLUs), denoted as $f(x)=max(0,x)$, is proposed as a non-saturating activated function that accelerates the training process. Also, ReLUs have the desirable property that they do not require input normalization to prevent them from saturating.

A four-layer convolutional neural network with ReLUs reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons.
Local normalization before activation aids generalization. Brightness normalization is proposed that replace the activity of a neuron ($a^{i}_{x,y}$) with ($b^{i}_{x,y}$) by the expression (1) where kernel $i$ at position $(x,y)$ , the sum runs over $n$ “adjacent” kernel maps at the same spatial position, and $N$ is the total number of kernels in the layer.

Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.

$$ b_{x, y}^{i}=a_{x,y}^{i} /\left(k+\alpha\sum_{j=\max (0, i-n / 2)}^{\min (N-1, i+n/ 2)}\left(a_{x,y}^{j}\right)^{2}\right)^{\beta} (1) $$

Overlapping pooling with $s = 2$ and $z = 3$ reduces the top-1 and top-5 error rates by 0.4% and 0.3%, compared to $s = 2$ and $z = 2$. Also, this technique makes network slightly more difficult to overfit.
Data augmentation consists of extracting random 224×224 patches (and their horizontal reflections) from the 256×256 images and training the network on these extracted patches. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.

Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks.

Also, they perform Principle Component Analysis(PCA) on the set of RGB pixel values for data augmentation.

This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.
Dropout is adopted that set to zero the output of each hidden neuron with probability 50%.

This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

Results

"We also entered our model in the ILSVRC-2012 competition and report our results in Table 2. Since the ILSVRC-2012 test set labels are not publicly available, we cannot report test error rates for all the models that we tried. In the remainder of this paragraph, we use validation and test error rates interchangeably because in our experience they do not differ by more than 0.1% (see Table 2). The CNN described in this paper achieves a top-5 error rate of 18.2%. Averaging the predictions of five similar CNNs gives an error rate of 16.4%. Training one CNN, with an extra sixth convolutional layer over the last pooling layer, to classify the entire ImageNet Fall 2011 release (15M images, 22K categories), and then “fine-tuning” it on ILSVRC-2012 gives an error rate of 16.6%. Averaging the predictions of two CNNs that were pre-trained on the entire Fall 2011 release with the aforementioned five CNNs gives an error rate of 15.3%. The second-best contest entry achieved an error rate of 26.2% with an approach that averages the predictions of several classifiers trained on FVs computed from different types of densely-sampled features¹."

J. Deng, A. Berg, S. Satheesh, H. Su, A. Khosla, and L. Fei-Fei. ILSVRC-2012, 2012. URL http://www.image-net.org/challenges/LSVRC/2012/.

Figure. ILSVRC-2012 (derived from Table 2)

Model	Top-1(val)	Top-5(val)	Top-5(test)
(2nd Model)	/	/	26.2%
AlexNet	36.7%	15.4%	15.3%

VGG:Depth is all you need (2014)

Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. https://www.semanticscholar.org/paper/Very-Deep-Convolutional-Networks-for-Large-Scale-Simonyan-Zisserman/eb42cf88027de515750f230b23b1a057dc782108

Introduction

The full name of VGG is the Visual Geometry Group, which belongs to the Department of Science and Engineering of Oxford University. It has released a series of convolutional network models beginning with VGG, which can be applied to face recognition and image classification, from VGG16 to VGG19. The original purpose of VGG's research on the depth of convolutional networks is to understand how the depth of convolutional networks affects the accuracy and accuracy of large-scale image classification and recognition. In order to deepen the number of network layers and to avoid too many parameters, a small 3x3 convolution kernel is used in all layers.

[!NOTE] IMPORTANT ILSVRC-2014 Competition: Localization(1st) Classification(2nd)

Architecture

The input of VGG is set to an RGB image of 224x244 size. The average RGB value is calculated for all images on the training set image, and then the image is input as an input to the VGG convolution network. A 3x3 or 1x1 filter is used, and the convolution step is fixed. . There are 3 VGG fully connected layers, which can vary from VGG11 to VGG19 according to the total number of convolutional layers + fully connected layers. The minimum VGG11 has 8 convolutional layers and 3 fully connected layers. The maximum VGG19 has 16 convolutional layers. +3 fully connected layers. In addition, the VGG network is not followed by a pooling layer behind each convolutional layer, or a total of 5 pooling layers distributed under different convolutional layers. The following figure is VGG Structure diagram:

### **Highlights**

A smaller 3x3 convolution kernel and a deeper network are used . The stack of two 3x3 convolution kernels is relative to the field of view of a 5x5 convolution kernel, and the stack of three 3x3 convolution kernels is equivalent to the field of view of a 7x7 convolution kernel. In this way, there can be fewer parameters (3 stacked 3x3 structures have only 7x7 structural parameters (3x3x3) / (7x7) = 55%); on the other hand, they have more The non-linear transformation increases the ability of CNN to learn features.
In the convolutional structure of VGGNet, a 1x1 convolution kernel is introduced. Without affecting the input and output dimensions, non-linear transformation is introduced to increase the expressive power of the network and reduce the amount of calculation.
During training, first train a simple (low-level) VGGNet A-level network, and then use the weights of the A network to initialize the complex models that follow to speed up the convergence of training .

Results

## GAN: Generative Adversarial Network (2014)

Ren, Y., & Li, X. (2023). Predicting the Daily Sea Ice Concentration on a Subseasonal Scale of the Pan-Arctic During the Melting Season by a Deep Learning Model. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–15. https://doi.org/10.1109/TGRS.2023.3279089

Introduction

Architecture

Highlights

GoogLeNet:Going Deeper with efficiency (2015)

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper With Convolutions. Computer Vision and Pattern Recognition (CVPR), 1–9. https://openaccess.thecvf.com/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html

Introduction

Architecture

Highlights

References

Recurrent neural network. (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Recurrent_neural_network&oldid=1220695644

Long short-term memory. (2024). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Long_short-term_memory&oldid=1220425877

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Neural Information Processing Systems, 25. https://papers.nips.cc/paper_files/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations. https://www.semanticscholar.org/paper/Very-Deep-Convolutional-Networks-for-Large-Scale-Simonyan-Zisserman/eb42cf88027de515750f230b23b1a057dc782108

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper With Convolutions. Computer Vision and Pattern Recognition (CVPR), 1–9. https://openaccess.thecvf.com/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html

【Note】Introduction to DNNs - Bili-Sakura/NOTES GitHub Wiki

Introduction to DNNs

Outline

RNN & LSTM

Introduction

Architecture

Highlights

AlexNet:Pioneer on DNNs (2012)

Introduction

Architecture

Parameters

Results

VGG:Depth is all you need (2014)

Introduction

Architecture

Results

Introduction

Architecture

Highlights

GoogLeNet:Going Deeper with efficiency (2015)

Introduction

Architecture

Highlights

References

⚠️ GitHub.com Fallback ⚠️

【Note】Introduction to DNNs - Bili-Sakura/NOTES GitHub Wiki

Introduction to DNNs

Outline

RNN & LSTM

Introduction

Architecture

Highlights

AlexNet:Pioneer on DNNs (2012)

Introduction

Architecture

Parameters

Results

VGG:Depth is all you need (2014)

Introduction

Architecture

Results

Introduction

Architecture

Highlights

GoogLeNet:Going Deeper with efficiency (2015)

Introduction

Architecture

Highlights

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️