Deep_Learning - RicoJia/notes GitHub Wiki

========================================================================

Success to a deep learning project

========================================================================

  1. Data mgmt: data is sepated into validation set and test set. You need to train different models to see which one performs the best
    • Make sure, your dev and training sets are from the same distributions
    • Overfitting = high variance, underfitting = high bias.
      • Variance means the difference from "high performance in training data, low performance in test data", and that's "overfitting".
      • Bias means the error compared to human error from overly simplistic model. Poor performance on training set is "underfitting". High bias
        • So if your human error is 15%, then the 15% ML model error rate is not considered high bias.
        • The same model could be high biased in certain landscapes, and high variance in others.

========================================================================

Deep Learning Math

========================================================================

Why Deep Learning?

Any bounded continuous function can be approximated by an arbitrarily large single layer. W hy? The idea is roughly that the linear combinations of activation function can compose a pulses, like a sigmoid, or 4 ReLus. Then, the pulses can compose an arbitrary function

Pulses Can Approximate An Arbitrary Function

A Neuron, Batch Gradient Descent

A Neuron, has multiple inputs and a single output. First it gets the weighted sum of all inputs, then feeds it into an "activation function". Below, the activation function $\sigma(z)$ is a "sigma function" $$ z = \sum_{i}^{n} w_ix_i = w_0x_0 + w_1 x_1 ... \ y = \sigma(z) = \frac{1}{1 + e^{-z}} $$

Image Source: Stackoverflow

So for a single input, x (nx1 vector), and its corresponding groudtruth value y, we can get its prediction $\hat{y}$ and cost $J$. In our example, J is cross-entropy, which has its minimum cost when $y = \hat{y}$ $$ \hat{y} = \sigma(\sum_{i}^{n} w_ix_i) \ J = -(ylog(\hat{y}) + (1-y)log(1-\hat{y})) $$ Then, we can apply gradient descent to update parameters, $w$, which is the goal of deep learning. The gradient is represented as $\nabla{J}$, which is the partial derivative w.r.t all parameters in vector $w$. Since gradient gives the steepest ascent, we want to update $w$ with the negative of that. $$ \nabla{J} = [\frac{\partial{J}}{\partial{w_0}}, \frac{\partial{J}}{\partial{w_1}} ...] \ w = w - \lambda\nabla{J} $$ For an individual partial derivative value, $$ \frac{\partial{J}}{\partial{w_i}} = \frac{\partial{J}}{\partial{\hat{y}}} \frac{\partial{\hat{y}}}{\partial{z}} \frac{\partial{z}}{\partial{w_i}} \ \frac{\partial{J}}{\partial{\hat{y}}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \ \frac{\partial{\hat{y}}}{\partial{z}} = \frac{e^{-z}}{(1 + e^{-z})^2} = \frac{1}{1 + e^{-z}} (1-\frac{1}{1 + e^{-z}}) = \hat{y}(1-\hat{y}) \ \frac{\partial{z}}{\partial{w_i}} = x_i \ => \ \frac{\partial{J}}{\partial{w_i}} = (\hat{y} - y)x_i \frac{\partial{J}}{\partial{w_i}} = (\hat{y} - y) \ $$ So from the single input $x$, we update $w$ with $$ w = w - \lambda\nabla{J} = w - \lambda(\hat{y} - y)x \ b = b - \lambda\nabla{J} = w - \lambda(\hat{y} - y) $$

Now if we have a batch inputs, that is $x^{(0)} ... x^{(m)}$ $$ J = -\frac{1}{m} \sum_{m}^{M} (y^{(m)} log(\hat{y}^{(m)}) + (1-y^{(m)})log(1-\hat{y}^{(m)})) \ w = w - \sum_{m}^{M}\lambda\nabla{J^{(m)}} = w - \frac{1}{m} \sum_{m}^{M} \lambda(\hat{y}^{(m)} - y^{(m)})x^{(m)} \ b = b - \lambda\nabla{J} = b - \lambda(\hat{y} - y) $$

Neural Network and and Back Propagation

Now, to be consistent with mainstream notations, we represent the output of each node as $a$, instead of $\hat{y}$.

Imagine we have two layers, an input layer, and an output layer. To update all params $w$ to yield better final output, we first pass inputs $x^{(0)} ... x^{(m)}$ through the network, and get outputs $y^{(0)} ... y^{(m)}$. The output of each node $Lj$ is $a^{L}_{j}$, where $L$ is the layer number.

When L is An Output Layer

$$ J = -\frac{1}{m} \sum_q^Q(ylog(a^L_q) + (1-y)log(1-a^L_q)) \\ \frac{\partial{J}}{\partial{a^L_q}} = -\frac{1}{m} \frac{y-a^L_q}{a^L_q(1-a^L_q)} $$

When L is A Hidden Layer

For node $a^{Lj}$, we assume all nodes in layer $L-1$ are connected to $A^{Lj}$ (Fully Connected), outputs from $L-1$ is written as vector $a^{L-1}$. $w^{L}{j}$ is parameter vector and $z^{L}{j}$ are scalar intermediate outputs

$$ w = w - \sum_{m}^{M}\lambda\nabla{J^{(m)}} \ \nabla{J^{(m)}} = \frac{\partial{J}}{\partial{w^L_j}} = \frac{\partial{J^L_j}}{\partial{a^L_j}} \cdot \frac{\partial{a^L_j}}{\partial{z^L_j}} \cdot \frac{\partial{z^L_j}}{\partial{w^L_j}} \ $$ For a specific node in layer L, $Lj$: $$ \frac{\partial{z^{L}{j}}}{\partial{w^{L}{j}}} = [a^{L-1}_0, a^{L-1}_1 ... ] = a^{L-1} \ \frac{\partial{a^{L}}}{\partial{z^{L}}} = \dot{\sigma{(z^{L})}} $$

The node is connected to nodes 0...Q in layer L+1, that is, partial derivative needs to consider all these influenced paths. With scalar output $a^{L}{j}$, scalar derivative $\frac{\partial{J}}{\partial{a^{L}{j}}}$

$$ \ \frac{\partial{J}}{\partial{a^{L}{j}}} = \sum{q}^{Q} \frac{\partial{J}}{\partial{a^{L+1}_q}} \frac{\partial{a^{L+1}_q}}{\partial{z^{L+1}_q}} \frac{\partial{z^{L+1}_q}}{\partial{a^{L}_j}} $$

To further reduce $\frac{\partial{J}}{\partial{a^{L}j}}$, with scalar derivative $\dot{\sigma{(z^{L+1}q)}}$, node $(L+1,q)$'s parameter vector $w^{L+1}{q}$, and the jth parameter $w^{L+1}{q, j}$ $$ z^{L+1}_q = w^{L+1}_q\cdot a^{L} => \frac{\partial{z^{L+1}}}{\partial{a^{L}j}} = w^{L+1}{q, j} \ a^{L+1} = \sigma{(z^{L+1})} => \dot{\sigma{(z^{L+1})}} \ => \frac{\partial{J}}{\partial{a^{L}_j}} = \sum_q^Q\frac{\partial{J}}{\partial{a^{L+1}q}} w^{L+1}{q,j} \dot{\sigma{(z^{L+1})}} $$

Summary

That was a lot of details with chain rule. So in all, during back propagation, assume we have batch size of m, input size n and output size p.

  1. Start from the output layer and back track

    • For context , compute $\frac{\partial{J}}{\partial{a^{L}_j}}$, given J being the binary cross entropy $$ \frac{\partial{J}}{\partial{a^L_q}} = \frac{y-a^L_q}{a^L_q(1-a^L_q)} \text{,if L is the output layer} \ $$
    • Then, for all nodes in the layer, and for all batches: $$ \frac{\partial{J}}{\partial{a}^L} = \frac{y-a}{a(1-a)} \ \frac{\partial{a}}{\partial{z}^L} = \dot{\sigma{(z^{L})}} \ \frac{\partial{J}}{\partial{z^{L}}} = \frac{\partial{J}}{\partial{a}^L} \frac{\partial{a}}{\partial{z}^L} $$
      • $y, a, z$ are mxp matrices.
  2. For non-output layers,

    • For context, each individual neuron has $$ \frac{\partial{J}}{\partial{a^{L}_j}} = \sum_q^Q\frac{\partial{J}}{\partial{z^{L+1}_q}} \frac{\partial{z^{L+1}_q}}{\partial{a^{L}_q}} = \sum_q^Q\frac{\partial{J}}{\partial{z^{L+1}q}} w^{L+1}{q,j} \text{,if L is not an output layer}

      \ \frac{\partial{J}}{\partial{z^{L}_j}} = \frac{\partial{J}}{\partial{a^{L}_j}} \frac{\partial{a^{L}_q}}{\partial{z^{L}_q}} = \frac{\partial{J}}{\partial{a^{L}_j}} \dot{\sigma{(z^{L})}} $$

    • Then, for all nodes in the layer, and for all batches: $$ \frac{\partial{J}}{\partial{a^{L}}} = (W^{L+1})^T \frac{\partial{J}}{\partial{z^{L}}} \ \frac{\partial{J}}{\partial{z^{L}}} = \frac{\partial{J}}{\partial{a^{L}}} \frac{\partial{a^{L}}}{\partial{z^{L}}} = \frac{\partial{J}}{\partial{a^{L}_j}} \dot{\sigma{(z^{L})}} $$

  3. Finally, during the update step:

    • For context, each individual neuron has $$ \frac{\partial{J}}{\partial{w^L_j}} = \frac{\partial{J^L_j}}{\partial{a^L_j}} \cdot \frac{\partial{a^L_j}}{\partial{z^L_j}} \cdot \frac{\partial{z^L_j}}{\partial{w^L_j}} \ w^L_j = w^L_j - \frac{1}{m}\sum_{m}^{M}\lambda \frac{\partial{J}}{\partial{w^L_j}} \ b^L_j = b^L_j - \frac{\partial{J}}{\partial{z^L_j}} $$
    • Then, for all nodes in the layer, and for all batches: $$ \frac{\partial{J}}{\partial{w^L}} = \frac{\partial{J}}{\partial{z^{L}}} \frac{\partial{z^{L}}}{\partial{w^L}} = \frac{\partial{J}}{\partial{z^{L}}} a^{(L-1)} \ w^L = w^L - \frac{1}{m}\sum_{m}^{M}\lambda \frac{\partial{J}}{\partial{w^L}} \ b^L_j = b^L - \frac{\partial{J}}{\partial{z^L}} $$

Supplementary Math

  1. Hadamard (Schur) Product is Elementwise Product $$ A \circ B = [A1B1, A2B2...] $$

========================================================================

Important Methods

========================================================================

Minibatch vs Batches

A huge batch could take a lot of time to evaluate for gradient descent. Instead, if data is independent enough, we can break a huge batch of training data into smaller batches so each update from a minibactch could help parameter convergence faster (mini-batch).

Another alternative is to have batch size being one (stochastic gradient descent). SGD is more noisy because a single update's gradient could be farther away from its true value, but updates are faster. Minibatches are always used for more stability. Usually minibatch sizes are [64, 512]

Cost Functions:

  1. Mean Squared Error $\frac{1}{n} \sum(y_i-\hat{y}_i)^2$

    • Disadvantages:
      • Sensitive to outliers with larger errors due to the squaring (especially compared to MAE)
      • If errors are non-gaussian, this is probably not robust either.
  2. Mean Absolute Error $\frac{1}{n} \sum(y_i-\hat{y}_i)$

    • Advantages:
      • Less sensitive to outliers
    • DisadvantagesL
      • Not differentiable at 0, which could be problematic, especially jacobians are near zero
  3. Cross Entropy Loss (Log Loss): $-\frac{1}{n} \sum (y_ilog(\hat{y}_i) + (1-y^i)log(1-\hat{y}_i))$

    • Advantages:
      • Good for classification problems.
      • Suitable for probablistic Interpretation, and penalizes wrong confident predictions heavily
    • Disadvantages:
      • when $y^i$ is (0,1). And the closer it is to the bounds, the loss would be larger
  4. Hinge Loss: $\frac{1}{n} \sum max(0, 1-y_i\hat{y}_i)$

    • Mostly used in SVM training, not working well with probablistic estimations
  5. Huber Loss $$ \frac{1}{2} (y_i - \hat{y}_i)^2, \text{for} |y_i - \hat{y}_i| < \delta \ \delta |y_i - \hat{y}_i| - \frac{1}{2} \delta^2, \text{otherwise} $$

    • Advantages:
      • it's continuous and differentiable
      • Combines MAE and MSE.
    • Requires tuning $\delta$

Activation Functions

Early papers found out that ReLu is always faster than Sigmoid because of its larger derivatives, and non-zero derivatives at positive regions.

However, now with more tricks like batch normalization,

  1. ReLu: y=max(0, x), at x=0 it's technically non-differentiable. But we can still fake it simply: y' = 0 or 1

    ReLu - Advantages: - This helps avoid diminishing derivatives? because large values have derivative being 1. - Computational efficiency. Easy to compute. - It outputs zero on negative values, which makes forward pass compute less, as well as the backward pass - Because fewer neurons have non-zero outputs, the less "sensitive" the system is to smaller changes in inputs. Making it a form of regulation. (See the regularization section) - Disadvantages: - Exploding gradients: gradient can become large (because TODO?) - Dead neurons: neurons may not be activated if their weights make the output negative values. - Unbounded outputs
  2. tanh: $\frac{2}{1+e^{-2x}}-1$

    tanh
    • Advantages:
      • tanh starts from -1 to 1, so if we want negative values in output, go for tanh
      • compared to sigmoid, its derivative in [-1, 1] has a larger range, which could help with the learning process.
    • Disadvantages:
      • For large and small values, gradients are zero (vanishing gradient problem)
      • Could be moderately expensive to train (with exponentials)
  3. sigmoid

    sigmoid - Advantages: - Disadvantages: - For large and small values, gradients are zero (vanishing gradient problem) - Its max is 0.25. This could be a huge disadvantage given that nowadays with more layers, this gradient diminishes fast - Could be moderately expensive to train (with exponentials)

Exploding Derivatives && Vanishing Gradients

  • When you have VERY DEEP NETWORKS, your gradients will grow exponentially large or small. initialized poorly, or learning rates are too high

    • Initialization Alternatives: Xavier and He.
      • Xavier initialization: for sigmoid or tanh
      • He initialization: this preseves a variance of $2/n$ in its weights, which is impirically good for ReLu
    • Gradient Clipping
    • Normalization Layers: use batch normalization or layer normalization
      • layer normalization:
      • Batch Normalization: TODO
  • A really helpful way to debug a network is to implement gradient check. ONLY TO CHECK!!

    1. Do a forward and backward pass. Get gradient $d = \frac{\partial J}{\partial w}$, and total cost $J$
    2. perturb 1 parameter $W^L_i$ by $\epsilon$. forward pass, get the cost J'.
    3. Calculate gradient: $d' = (J-J')/ \epsilon$. $$ \frac{|d-d'|}{|d| + |d'|} $$
      • If gradient is $ < 10^{-7}$, great! If smaller than $10^{-3}$, bad
    4. Note:
      • Run again after some training, because initialization might randomly yield good results there
      • Need to add regularization terms here
      • Doesn't work with Drop out

Overfitting - Regulation

  • Capacity: ability to fit a wide variety of functions. Models with complex patterns may also be overfitting, thus have smaller capacity.

  • Technique 1: Regulation is to reduce overfitting by penalizing the "complexity" of the model. Common methods include:

    • L1 and L2 regularization:
      • L1 encourages sparsity: NOT SUPER COMMON
      • L2 penalizes large weights: TODO
    • Dropout: force a fraction of neurons to zero during each iteration. Redundant representation
      • at each training, randomly select neurons to turn on. Say you want 80% to be kept. This means we will not rely on certain features. Instead, we shuffle that focus, which spreads the weights
      • VERY IMPORTANT: computer vision uses this a lot. Because you have a lot of pixels, relying on every single pixel could be overfitting
    • So, there are fewer neurons effectively in the model, hence makes the decision boundary simpler and more linear.
  • Note: when the activation is tanh, when w is small, the intermediate output z of the neuron is small. tanh is linear near zero. So, the output model is more like a perceptron network, which learns linear decision boundaries. (Hence, the model is less likely to be overfitting)

  • Technique 2 is data augmentation. Like getting a reflection of a cat. And random distortions, and additions

  • Technique 4: early stopping - stop train as you validate on the dev set. If you know realize that your training error is coming back up, use the best one

Training Set Up

  • Normalize data. Shift the mean, and even the variance?? If your input data comes from very different sources, doing min-max could be helpful.

========================================================================

Convolutional Neural Network

========================================================================

  1. Filters (aka kernels): "Pattern Detectors". Each filter is a small matrix, which you can drag along an image and multiply pixel values with (convolution). They can detect edges, corners, and later, parts of dogs.

    Filter
  2. Difference with convolution in signal processing: in signal processing, we need to reverse the filter horizontally and vertically. Why? $$ 1,2 \ 3,4 \ => \ 4,3 \ 2,1 $$

  • In 1D signal processing, a causal system could have an impulse response [1,2,3]. If we now send an impulse, which is [1,0,0] at [t1, t2, t3], we expect to see [1,2,3], but to represent that in a sliding window "[0, 0, 1]" (flipped in time) so at t1 we get 1, t2 we get 2, t3 we get 3. This is the causality consistency, or "the correctness of sequence of outputs"

  • For feature detection, we don't need to flip the kernel. Consider we want to detect a 1D slope [1,0]. Say we have a slope: [3,2,1,0]. If we flip the kernel, we will get negative responses! So, no need to flip the kernel for CNN. It's actually called "cross-correlation"

  • Padding:

    • Why padding?
      • Keeps info at the borders. Information at the borders will be washed away too quickly
      • Allows us to design deeper neural networks, Otherwise the inputs will diminish
    • The image above is zero padding. But there are also three types of padding:
      • 'valid' : no padding at all.
        [1, 2, 3]
        [1, 1, 1]
        => [6]
        
      • 'same': output is the same as the input
        [0, 1, 2, 3, 0]
        [1, 1, 1]
            [1, 1, 1]
                [1, 1, 1]
        => [3, 6, 5]
        
      • 'full', Pads even more zeros, so each pixel in input touches the kernel same number of times
        [0, 0, 1, 2, 3, 0, 0]
        [1, 1, 1]
           [1, 1, 1]
              [1, 1, 1]
                 [1, 1, 1]
                    [1, 1, 1]
        => [1, 3, 6, 5, 3]
        
    • $\frac{n + 2p - k}{s} + 1$, where n is input size, k is kernel size, k is kernel size.
  1. Max pooling: get the max within a window, to retain the most salient feature. Move the kernel to the right and to the left by stride after 1 operation. Given
    1  3  2  4
    5  6  7  8
    9  2  3  1
    4  0  1  5
    

For a 2x2 window, with stride = 2, we get 6, 8, 9, 5

  1. Convolutional Neural nets were invented by Yann Lecun in 1990. Each convolution layer has multiple filters. For RGB values, there are 3 channels: -> 4 output channels? Each output channel is a feature map. So if we need 3 kernels for each output channel, there'd be in total 3 x 4 =12 kernels.

    - 1 Example: say we have an 32x32x1 grayscale image: 1. convolutional layer with 2 3x3 filters with stride = 1 with no padding, get 30x30x2 outputs, each output is a feature map -> activation function -> pooling layer of 2x2, stride =1 -> output of 15x15x2 2. convolutional layer with 4 3x3x2 filters with stride = 1 with padding of 2 (so input and output layer size are the same). Each filter takes in both channel, then convolve with the two input channels, add them, then outputs. So you will get 15x15x4 -> activation function -> pooling layer. 3. Flattening layer: 15x15x4 -> 900 x1 vector. Note that all channels are flattened into one 4. Fully connected layer
  2. How does back propagation work? I got an example from Youtube channel far1din Neural Net: example https://github.com/TheIndependentCode/Neural-Network

    • If we expand: $$ y_{11} = x_{11}k_{11} + x_{12}k_{12} + x_{13}k_{13} + ... \ y_{12} = x_{12}k_{11} + x_{13}k_{12} + x_{14}k_{13} + ... \ y_{21} = x_{21}k_{11} + x_{22}k_{12} + x_{23}k_{13} + ... $$

    • We need to get $\frac{J}{W}$ and $\frac{J}{X}$. $$ \frac{J}{x_11} = \frac{J}{y_11}k_11 \frac{J}{x_12} = \frac{J}{y_11}k_12 + \frac{J}{y_12}k_11 $$

    • This is equivalent to: $$ y_{11}K + [0, y_12]K + ... $$

    • For kernel gradient $\frac{J}{K}$, it's actually cross correlation: $x \ast \frac{\partial J}{\partial y}$ $$ \frac{J}{k_{11}} = \frac{J}{y_{11}}x_{11} + \frac{J}{y_{12}}x_{12} + \frac{J}{y_{21}}x_{21} + \frac{J}{y_{22}}x_{22} \ \frac{J}{k_{12}} = \frac{J}{y_{11}}x_{12} + \frac{J}{y_{12}}x_{13} + \frac{J}{y_{21}}x_{22} + \frac{J}{y_{22}}x_{23} $$

    • For bias gradient, there is only 1 bias value per output channel; so, we apply it to all elements of one channel. Its gradient is the sum across all channels: $\frac{J}{b}$, it's output gradient $\sum_c \frac{\partial J}{\partial y_c}$

    • For input gradient, $\frac{J}{X}$, it's actually convolution: $k \circledast \frac{\partial J}{\partial y}$ $$ \frac{J}{x_{11}} = \frac{J}{y_{11}}k_{11} \frac{J}{x_{12}} = \frac{J}{y_{11}}k_{12} + \frac{J}{y_{12}}k_{11} \frac{J}{x_{13}} = \frac{J}{y_{11}}k_{13} + \frac{J}{y_{12}}k_{12} + \frac{J}{y_{12}}k_{11} ... $$

      • See? This is convolution!
  3. how do we handle batches? output: List(batch, output channel, x, y)

Implementation

A small input to a CNN has 4 dimensions: [Batch dimension, input channel, Image Height, Image Width]

a = np.array(
    [
        # Batch dimension
        [
            # input channel
            [
                [-1, 2, 3], 
                [-4, 5, 6], 
                [-7, 8, 9]
            ]         
        
        ]
    ], dtype=np.float32
)
# WE need to convert this into a torch tensor
# Then, we need to enable grad so that the tensor will start accumulating its gradients.
# Quirk 1: from_numpy itself does NOT create a copy of a. 
torch.from_numpy(a).requires_grad_(True)

========================================================================

Alex Net

======================================================================== Impl

Architecture

  1. Convolution Layer
  2. Relu:
    • Learns non-linearity in features
    • Feature Scaling: ensuring negative values do not pass thru. Reducing multiplications
⚠️ **GitHub.com Fallback** ⚠️