Deep Learning Bootcamp 2 - bailingzheng/miniTorch GitHub Wiki

Learning Deep Representations Kaiming He, 2024 (https://youtu.be/D_jt-xO_RmI?feature=shared)

Deep learning is representation learning

represent raw data to solve complex problems
- compression, abstraction, and conceptualization
- raw data in different forms: pixels, words, waves, states, molecules, DNA, ...
compose simple modules into complex functions
- build multiple levels of abstraction
- learn by back propagation
- learn from data
- reduce domain knowledge and feature engineering

Going deep with neural nets

LeNet (1989): Backpropagation Applied to Handwritten Zip Code Recognition
- convolution = local connections + spatial weight sharing
- pooling: produce small feature map; achieve local invariance
- fully connected layer
- train by back prop
AlexNet(2012): ImageNet classification with deep convolutional neural networks
- data scaling: 1.28m images, 1k classes
- model scaling: 6m parameters
- reduce overfitting: data augmentation, dropout
- GPU training
- deeper: higher level abstraction
- wider (more channels): richer set of features
- ReLU
Visualizing ConvNet (2013): Visualizing and Understanding Convolutional Networks
- to find what input can produce the feature
Deep representations are transferrable: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, CNN Features off-the-shelf: an Astounding Baseline for Recognition
- the single most important discovery in DL revolution
- pre-train on large scale data & fine tune on small scale data
- enable DL on small datasets
VGG (2014): Very Deep Convolutional Networks for Large-Scale Image Recognition
- simple, highly modularized design
- only 3x3 conv
- simple rules of setting channels
- stack the same modules
- up to 19 layers
- clear evidence: deeper is better
- stage wise training
network initialization (2010): Understanding the difficulty of training deep feedforward neural networks
- troubles accumulate in propagation
- exploding/vanishing signals
- xavier initialization: n_d * Var(W_d) = 1
- normalization based on analytical derivation
- valid only at the beginning of the training
network initialization for ReLU (2015): Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- kaiming initialization: 1/2 * n_d * Var(W_d) = 1
- train VGG
GoogLeNet/Inception (2015): Going Deeper with Convolutions
- deep and economical convNets
- multiple branches
- 1x1 bottleneck: dimension reduction
- 1x1 shortcut
- rich design space for compute/accuracy trade off
- break initialization assumptions
- motivate normalization methods
Normalization modules (2015): Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- keep signals normalized throughout the training
- batch norm, layer norm, instance norm, group norm
- enable models that are otherwise not trainable
- speed up convergence
- improve accuracy
ResNet (2015): Deep Residual Learning for Image Recognition
- accuracy degrades after 20 layers
- simple component: identity shortcut
- deep learning gets way deeper

Sequence modeling

RNN (2016): Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
- recurrent neural nets have loops
- unfold in time
- RNN = local connections + temporal weight sharing
- stack LSTM units
- going deep with residual connections
CNN (2016): WaveNet: A Generative Model for Raw Audio
- conv along 1d sequence
- deeper for longer context
- causal conv with dilation
- going deep with residual connections
Attention (2017): Attention Is All You Need
- every node can see every other node
- attention is parameter free
- parameter layers are feed-forward: Q, K, V projections and MLP block
- going deep: LayerNorm and residual connections
ViT (2020): An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- sequences of image patches
- transformers on patches
AlphaFold (2021): Highly accurate protein structure prediction with AlphaFold
- representation learning for protein folding
- amino acid sequences -> protein structures
- 48 transformer blocks