04 Forward and Back Propagation - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

Derivative Chain Rule Quick Review

The chain rule tells us how to find the derivative of a composite function. If $h(x)=g(f(x))$,

$$ h'(x)=g'\left(f(x)\right) \cdot f('x) $$

$$ \dfrac {d} {dx}~h(x) = \dfrac{d}{dx}~g\left(f(x)\right) = \dfrac{dg}{df} \cdot \dfrac{df}{dx} $$

Intuitive Explanation

If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man.

Applying Chain Rule to a single Neuron

Loss depends on weight through many intermediate steps, we need to find $w$ that minimize $L$

$$ \begin{alignat*}{3} L &= L(a) &&\ a &= \sigma(z) && \ z &= wx + b && \ \end{alignat*} $$

Chain Rule helps in tracing how changes in $w$ propagate to $L$:

$$ \dfrac{\partial L} {\partial w}= \dfrac{\partial L} {\partial a} \cdot \dfrac{\partial a} {\partial z} \cdot \dfrac{\partial z} {\partial w} $$

For example, a neuron with input $x$, weight $w$, activation is sigmoid and the Loss in Binary Cross Entropy is:

$$ \begin{alignat*}{3} z &= wx + b && \Rightarrow \dfrac{\partial z} {\partial w}=x\ a &= \sigma(z) = \dfrac{1}{1+e^{-z}}&& \Rightarrow \dfrac{\partial a} {\partial z}=\sigma(z)(1-\sigma(z))=a(1-a)\ L &= -[y~\log(a)+(1-y)~\log(1-a)] &&\Rightarrow \dfrac{\partial L} {\partial a}=-\dfrac{y}{a} + \dfrac{(1-y)}{(1-a)}\ \end{alignat*} $$

Derivation of $\dfrac{\partial a} {\partial z}$ see here
Derivation of $\dfrac{\partial L} {\partial a}$ see here

$$ \begin{alignat*}{3} \therefore \dfrac{\partial L} {\partial w}& = \dfrac{\partial L} {\partial a} \cdot \dfrac{\partial a} {\partial z} \cdot \dfrac{\partial z} {\partial w} &&\ &=\left(-\dfrac{y}{a} + \dfrac{(1-y)}{(1-a)} \right) \cdot a(1-a) \cdot x && \ &=(a-y) \cdot x && \ \end{alignat*} $$

Forward Propagation: Calculate $z$, then $a$, then Loss $L$.
Back Propagation: Calculate the gradient from $a$ and truth label $y$, $\frac{\partial L}{\partial w} = (a-y)x$.

2-Layer Neural Network: Forward & Backward Propagation

This section details the math for a simple 2-layer Neural Network used for Binary Classification.

Layer 1 (Hidden): Uses ReLU activation.
Layer 2 (Output): Uses Sigmoid activation.
Loss: Binary Cross Entropy.

1. The Forward Pass (Prediction)

We compute the activations layer by layer.

Layer 1 (Hidden):

$$ \begin{aligned} \mathbf{z}^{(1)} &= W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \ \mathbf{a}^{(1)} &= \text{ReLU}(\mathbf{z}^{(1)}) \end{aligned} $$

Layer 2 (Output):

$$ \begin{aligned} \mathbf{z}^{(2)} &= W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} \ \widehat{y} = \mathbf{a}^{(2)} &= \sigma(\mathbf{z}^{(2)}) \end{aligned} $$

Loss Calculation:

$$ L = -\left[ y \log(\widehat{y}) + (1-y) \log(1-\widehat{y}) \right] $$

2. The Backward Pass (Gradient Calculation)

We calculate gradients starting from the Loss and moving backward to the input. We use the Chain Rule recursively.

Step 1: Gradient at Output (Layer 2 Error)

We need the derivative of Loss w.r.t the raw output score $\mathbf{z}^{(2)}$. We apply the Chain Rule:

$$ \delta^{(2)} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \dfrac{\partial L}{\partial \widehat{y}} \cdot \dfrac{\partial \widehat{y}}{\partial \mathbf{z}^{(2)}} $$

To solve this, we derive the two parts separately and multiply them:

1. Derivative of BCE Loss w.r.t Prediction ($\widehat{y}$): From the definition of Binary Cross Entropy:

$$ \dfrac{\partial L}{\partial \widehat{y}} = -\dfrac{y}{\widehat{y}} + \dfrac{1-y}{1-\widehat{y}} = \dfrac{\widehat{y} - y}{\widehat{y}(1-\widehat{y})} $$

2. Derivative of Sigmoid Activation w.r.t Logit ($z$): From the definition of Sigmoid:

$$ \dfrac{\partial \widehat{y}}{\partial \mathbf{z}^{(2)}} = \sigma'(\mathbf{z}^{(2)}) = \widehat{y}(1-\widehat{y}) $$

3. Combine them (The Cancellation): Multiplying the two results, the denominator of the Loss gradient cancels perfectly with the Sigmoid gradient:

$$ \begin{aligned} \delta^{(2)} &= \left[ \dfrac{\widehat{y} - y}{\widehat{y}(1-\widehat{y})} \right] \cdot \left[ \widehat{y}(1-\widehat{y}) \right] \ \ &= \widehat{y} - y \end{aligned} $$

Result:

$$ \delta^{(2)} = \widehat{y} - y $$

Step 2: Gradients for Layer 2 Parameters ($W^{(2)}, b^{(2)}$)

Context: Now that we know the error at the output ($\delta^{(2)}$), we need to calculate how much each weight in Matrix $W^{(2)}$ contributed to that error, so we can update them.

We need to find $\dfrac{\partial L}{\partial W^{(2)}}$.

The Chain Rule Path: The Loss depends on $z^{(2)}$, and $z^{(2)}$ depends on the weights $W^{(2)}$:

$$ \dfrac{\partial L}{\partial W^{(2)}} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \dfrac{\partial \mathbf{z}^{(2)}}{\partial W^{(2)}} $$

1. The First Term: We calculated this in Step 1.

$$ \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \delta^{(2)} $$

2. The Second Term: Look at the linear equation: $\mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}$. The derivative of $z$ with respect to the weight matrix is simply the input vector $\mathbf{a}^{(1)}$.

3. Combine and Transpose (Matrix Calculus): Mathematically, we are multiplying the error $\delta^{(2)}$ by the input $\mathbf{a}^{(1)}$. However, we must align dimensions to match the shape of the weight matrix $W^{(2)}$.

Goal: The gradient matrix $\frac{\partial L}{\partial W^{(2)}}$ must have the same shape as $W^{(2)}$: $(n_{out}, n_{hidden})$.
We have:
- Error $\delta^{(2)}$ is a Column Vector $(n_{out}, 1)$.
- Input $\mathbf{a}^{(1)}$ is a Column Vector $(n_{hidden}, 1)$.

To get a Matrix of size $(n_{out}, n_{hidden})$ from two column vectors, we must perform an Outer Product. This requires turning the second vector (the input) into a Row Vector by transposing it.

$$ \begin{aligned} \dfrac{\partial L}{\partial W^{(2)}} &= \delta^{(2)} \cdot (\mathbf{a}^{(1)})^T \ (n_{out}, n_{hidden}) &= (n_{out}, 1) \cdot (1, n_{hidden}) \end{aligned} $$

Intuition: Think of it as: Gradient = Error * Input.

If the input $\mathbf{a}^{(1)}$ was large (strong signal), the weight contributed more to the error.
If the error $\delta^{(2)}$ is large, the weight needs a larger update.
The Outer Product calculates this "contribution" for every possible pair of (Output Node, Input Node).

4. Bias Gradient: For the bias $b^{(2)}$, the input is effectively always 1. So the gradient is simply the error itself:

$$ \dfrac{\partial L}{\partial \mathbf{b}^{(2)}} = \delta^{(2)} $$

Step 3: Propagate Gradient to Hidden Layer (Layer 1 Error)

Context: In Step 1, we calculated the error at the output ($\delta^{(2)}$). Now, we need to send this error backwards to the hidden layer so we can update Layer 1's weights later.

We need to find $\dfrac{\partial L}{\partial \mathbf{a}^{(1)}}$.

The Chain Rule Path: The Loss $L$ depends on $z^{(2)}$, and $z^{(2)}$ depends on $\mathbf{a}^{(1)}$:

$$ \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \dfrac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{a}^{(1)}} $$

1. The First Term: We already calculated this in Step 1.

$$ \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \delta^{(2)} $$

2. The Second Term: Look at the linear equation for Layer 2:

$$ \mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} $$

To find the derivative w.r.t $\mathbf{a}^{(1)}$, we treat $W^{(2)}$ as a constant coefficient:

$$ \dfrac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{a}^{(1)}} = W^{(2)} $$

3. Combine and Transpose (Matrix Calculus): Mathematically, we multiply $\delta^{(2)}$ by $W^{(2)}$. However, we must align the dimensions for the matrix multiplication to work.

$\delta^{(2)}$ is size $(n_{out}, 1)$.
$W^{(2)}$ is size $(n_{out}, n_{hidden})$.
We need the result $\frac{\partial L}{\partial \mathbf{a}^{(1)}}$ to match the size of $\mathbf{a}^{(1)}$, which is $(n_{hidden}, 1)$.

To achieve this shape, we must use the Transpose of the Weight matrix:

$$ \begin{aligned} \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} &= (W^{(2)})^T \cdot \delta^{(2)} \ (n_{hidden}, 1) &= (n_{hidden}, n_{out}) \cdot (n_{out}, 1) \end{aligned} $$

Intuition: The error $\delta^{(2)}$ flows backward through the weights. The transpose $(W^{(2)})^T$ effectively "distributes" the error from the output neurons back to the hidden neurons that contributed to them.

Step 4: Gradient through Activation (ReLU)

Context: In Step 3, we calculated the error at the output of the hidden layer ($\frac{\partial L}{\partial \mathbf{a}^{(1)}}$). Now, we need to pass this error through the ReLU activation function to get the error at the pre-activation state ($\mathbf{z}^{(1)}$). This is necessary to update the weights $W^{(1)}$ later.

The Chain Rule Path: The Loss depends on $\mathbf{a}^{(1)}$, and $\mathbf{a}^{(1)}$ depends on $\mathbf{z}^{(1)}$ via the activation function:

$$ \mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) $$

Therefore, by the Chain Rule:

$$ \delta^{(1)} = \dfrac{\partial L}{\partial \mathbf{z}^{(1)}} = \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} \cdot \dfrac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{z}^{(1)}} $$

1. The First Term (Upstream Gradient): We calculated this in Step 3:

$$ \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} = (W^{(2)})^T \cdot \delta^{(2)} $$

2. The Second Term (Local Gradient): This is the derivative of the activation function itself. Since $\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)})$, the derivative is simply:

$$ \begin{aligned} \dfrac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{z}^{(1)}} &= \sigma'_{\text{ReLU}}(\mathbf{z}^{(1)}) &&\ &= \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \le 0 \end{cases} \end{aligned} $$

(This is often written as the indicator function $\mathbb{1}[\mathbf{z}^{(1)} > 0]$ ).

3. Combine (Element-wise Multiplication): Unlike the weights (which mix all inputs together via matrix multiplication), the activation function operates on every neuron independently.

Neuron 1's activation only depends on Neuron 1's $z$.
Neuron 2's activation only depends on Neuron 2's $z$.

Because there is no "mixing" between neurons here, we use Element-wise Multiplication (Hadamard Product, denoted by $\odot$), not Matrix Multiplication.

$$ \begin{aligned} \delta^{(1)} &= \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} \odot \sigma'_{\text{ReLU}}(\mathbf{z}^{(1)}) \ &= \left( (W^{(2)})^T \delta^{(2)} \right) \odot \mathbb{1}[\mathbf{z}^{(1)} > 0] \end{aligned} $$

Intuition: Think of the ReLU derivative as a Gate:

If the neuron fired ($z > 0$), the gate is Open (1). The error flows through unchanged.
If the neuron was dead ($z \le 0$), the gate is Closed (0). The error is killed, and no gradient flows back to update the weights for this specific neuron.

Step 5: Gradients for Layer 1 Parameters ($W^{(1)}, b^{(1)}$)

Finally, we calculate gradients for the first layer weights. This follows the exact same pattern as Step 2.

$$ \begin{aligned} \dfrac{\partial L}{\partial W^{(1)}} &= \delta^{(1)} \cdot (\mathbf{x})^T \ \dfrac{\partial L}{\partial \mathbf{b}^{(1)}} &= \delta^{(1)} \end{aligned} $$

Summary Algorithm (General Form)

For any layer $l$:

Compute Error term ($\delta^{(l)}$):
- Output Layer: $\widehat{y} - y$
- Hidden Layer: $(W^{(l+1)})^T \delta^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})$
Compute Weight Gradient:
- $\delta^{(l)} \cdot (\mathbf{a}^{(l-1)})^T$
Compute Bias Gradient:
- $\delta^{(l)}$