04 Forward and Back Propagation - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
Derivative Chain Rule Quick Review
The chain rule tells us how to find the derivative of a composite function. If $h(x)=g(f(x))$,
$$ h'(x)=g'\left(f(x)\right) \cdot f('x) $$
or
$$ \dfrac {d} {dx}~h(x) = \dfrac{d}{dx}~g\left(f(x)\right) = \dfrac{dg}{df} \cdot \dfrac{df}{dx} $$
Intuitive Explanation
- If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man.
Applying Chain Rule to a single Neuron
Loss depends on weight through many intermediate steps, we need to find $w$ that minimize $L$
$$ \begin{alignat*}{3} L &= L(a) &&\ a &= \sigma(z) && \ z &= wx + b && \ \end{alignat*} $$
Chain Rule helps in tracing how changes in $w$ propagate to $L$:
$$ \dfrac{\partial L} {\partial w}= \dfrac{\partial L} {\partial a} \cdot \dfrac{\partial a} {\partial z} \cdot \dfrac{\partial z} {\partial w} $$
For example, a neuron with input $x$, weight $w$, activation is sigmoid and the Loss in Binary Cross Entropy is:
$$ \begin{alignat*}{3} z &= wx + b && \Rightarrow \dfrac{\partial z} {\partial w}=x\ a &= \sigma(z) = \dfrac{1}{1+e^{-z}}&& \Rightarrow \dfrac{\partial a} {\partial z}=\sigma(z)(1-\sigma(z))=a(1-a)\ L &= -[y~\log(a)+(1-y)~\log(1-a)] &&\Rightarrow \dfrac{\partial L} {\partial a}=-\dfrac{y}{a} + \dfrac{(1-y)}{(1-a)}\ \end{alignat*} $$
-
Derivation of $\dfrac{\partial a} {\partial z}$ see here
-
Derivation of $\dfrac{\partial L} {\partial a}$ see here
$$ \begin{alignat*}{3} \therefore \dfrac{\partial L} {\partial w}& = \dfrac{\partial L} {\partial a} \cdot \dfrac{\partial a} {\partial z} \cdot \dfrac{\partial z} {\partial w} &&\ &=\left(-\dfrac{y}{a} + \dfrac{(1-y)}{(1-a)} \right) \cdot a(1-a) \cdot x && \ &=(a-y) \cdot x && \ \end{alignat*} $$
- Forward Propagation: Calculate $z$, then $a$, then Loss $L$.
- Back Propagation: Calculate the gradient from $a$ and truth label $y$, $\frac{\partial L}{\partial w} = (a-y)x$.
2-Layer Neural Network: Forward & Backward Propagation
This section details the math for a simple 2-layer Neural Network used for Binary Classification.
- Layer 1 (Hidden): Uses ReLU activation.
- Layer 2 (Output): Uses Sigmoid activation.
- Loss: Binary Cross Entropy.
1. The Forward Pass (Prediction)
We compute the activations layer by layer.
Layer 1 (Hidden):
$$ \begin{aligned} \mathbf{z}^{(1)} &= W^{(1)}\mathbf{x} + \mathbf{b}^{(1)} \ \mathbf{a}^{(1)} &= \text{ReLU}(\mathbf{z}^{(1)}) \end{aligned} $$
Layer 2 (Output):
$$ \begin{aligned} \mathbf{z}^{(2)} &= W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} \ \widehat{y} = \mathbf{a}^{(2)} &= \sigma(\mathbf{z}^{(2)}) \end{aligned} $$
Loss Calculation:
$$ L = -\left[ y \log(\widehat{y}) + (1-y) \log(1-\widehat{y}) \right] $$
2. The Backward Pass (Gradient Calculation)
We calculate gradients starting from the Loss and moving backward to the input. We use the Chain Rule recursively.
Step 1: Gradient at Output (Layer 2 Error)
We need the derivative of Loss w.r.t the raw output score $\mathbf{z}^{(2)}$. We apply the Chain Rule:
$$ \delta^{(2)} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \dfrac{\partial L}{\partial \widehat{y}} \cdot \dfrac{\partial \widehat{y}}{\partial \mathbf{z}^{(2)}} $$
To solve this, we derive the two parts separately and multiply them:
1. Derivative of BCE Loss w.r.t Prediction ($\widehat{y}$): From the definition of Binary Cross Entropy:
$$ \dfrac{\partial L}{\partial \widehat{y}} = -\dfrac{y}{\widehat{y}} + \dfrac{1-y}{1-\widehat{y}} = \dfrac{\widehat{y} - y}{\widehat{y}(1-\widehat{y})} $$
2. Derivative of Sigmoid Activation w.r.t Logit ($z$): From the definition of Sigmoid:
$$ \dfrac{\partial \widehat{y}}{\partial \mathbf{z}^{(2)}} = \sigma'(\mathbf{z}^{(2)}) = \widehat{y}(1-\widehat{y}) $$
3. Combine them (The Cancellation): Multiplying the two results, the denominator of the Loss gradient cancels perfectly with the Sigmoid gradient:
$$ \begin{aligned} \delta^{(2)} &= \left[ \dfrac{\widehat{y} - y}{\widehat{y}(1-\widehat{y})} \right] \cdot \left[ \widehat{y}(1-\widehat{y}) \right] \ \ &= \widehat{y} - y \end{aligned} $$
Result:
$$ \delta^{(2)} = \widehat{y} - y $$
Step 2: Gradients for Layer 2 Parameters ($W^{(2)}, b^{(2)}$)
Context: Now that we know the error at the output ($\delta^{(2)}$), we need to calculate how much each weight in Matrix $W^{(2)}$ contributed to that error, so we can update them.
We need to find $\dfrac{\partial L}{\partial W^{(2)}}$.
The Chain Rule Path: The Loss depends on $z^{(2)}$, and $z^{(2)}$ depends on the weights $W^{(2)}$:
$$ \dfrac{\partial L}{\partial W^{(2)}} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \dfrac{\partial \mathbf{z}^{(2)}}{\partial W^{(2)}} $$
1. The First Term: We calculated this in Step 1.
$$ \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \delta^{(2)} $$
2. The Second Term: Look at the linear equation: $\mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)}$. The derivative of $z$ with respect to the weight matrix is simply the input vector $\mathbf{a}^{(1)}$.
3. Combine and Transpose (Matrix Calculus): Mathematically, we are multiplying the error $\delta^{(2)}$ by the input $\mathbf{a}^{(1)}$. However, we must align dimensions to match the shape of the weight matrix $W^{(2)}$.
- Goal: The gradient matrix $\frac{\partial L}{\partial W^{(2)}}$ must have the same shape as $W^{(2)}$: $(n_{out}, n_{hidden})$.
- We have:
- Error $\delta^{(2)}$ is a Column Vector $(n_{out}, 1)$.
- Input $\mathbf{a}^{(1)}$ is a Column Vector $(n_{hidden}, 1)$.
To get a Matrix of size $(n_{out}, n_{hidden})$ from two column vectors, we must perform an Outer Product. This requires turning the second vector (the input) into a Row Vector by transposing it.
$$ \begin{aligned} \dfrac{\partial L}{\partial W^{(2)}} &= \delta^{(2)} \cdot (\mathbf{a}^{(1)})^T \ (n_{out}, n_{hidden}) &= (n_{out}, 1) \cdot (1, n_{hidden}) \end{aligned} $$
Intuition:
Think of it as: Gradient = Error * Input.
- If the input $\mathbf{a}^{(1)}$ was large (strong signal), the weight contributed more to the error.
- If the error $\delta^{(2)}$ is large, the weight needs a larger update.
- The Outer Product calculates this "contribution" for every possible pair of (Output Node, Input Node).
4. Bias Gradient: For the bias $b^{(2)}$, the input is effectively always 1. So the gradient is simply the error itself:
$$ \dfrac{\partial L}{\partial \mathbf{b}^{(2)}} = \delta^{(2)} $$
Step 3: Propagate Gradient to Hidden Layer (Layer 1 Error)
Context: In Step 1, we calculated the error at the output ($\delta^{(2)}$). Now, we need to send this error backwards to the hidden layer so we can update Layer 1's weights later.
We need to find $\dfrac{\partial L}{\partial \mathbf{a}^{(1)}}$.
The Chain Rule Path: The Loss $L$ depends on $z^{(2)}$, and $z^{(2)}$ depends on $\mathbf{a}^{(1)}$:
$$ \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} = \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} \cdot \dfrac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{a}^{(1)}} $$
1. The First Term: We already calculated this in Step 1.
$$ \dfrac{\partial L}{\partial \mathbf{z}^{(2)}} = \delta^{(2)} $$
2. The Second Term: Look at the linear equation for Layer 2:
$$ \mathbf{z}^{(2)} = W^{(2)}\mathbf{a}^{(1)} + \mathbf{b}^{(2)} $$
To find the derivative w.r.t $\mathbf{a}^{(1)}$, we treat $W^{(2)}$ as a constant coefficient:
$$ \dfrac{\partial \mathbf{z}^{(2)}}{\partial \mathbf{a}^{(1)}} = W^{(2)} $$
3. Combine and Transpose (Matrix Calculus): Mathematically, we multiply $\delta^{(2)}$ by $W^{(2)}$. However, we must align the dimensions for the matrix multiplication to work.
- $\delta^{(2)}$ is size $(n_{out}, 1)$.
- $W^{(2)}$ is size $(n_{out}, n_{hidden})$.
- We need the result $\frac{\partial L}{\partial \mathbf{a}^{(1)}}$ to match the size of $\mathbf{a}^{(1)}$, which is $(n_{hidden}, 1)$.
To achieve this shape, we must use the Transpose of the Weight matrix:
$$ \begin{aligned} \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} &= (W^{(2)})^T \cdot \delta^{(2)} \ (n_{hidden}, 1) &= (n_{hidden}, n_{out}) \cdot (n_{out}, 1) \end{aligned} $$
Intuition: The error $\delta^{(2)}$ flows backward through the weights. The transpose $(W^{(2)})^T$ effectively "distributes" the error from the output neurons back to the hidden neurons that contributed to them.
Step 4: Gradient through Activation (ReLU)
Context: In Step 3, we calculated the error at the output of the hidden layer ($\frac{\partial L}{\partial \mathbf{a}^{(1)}}$). Now, we need to pass this error through the ReLU activation function to get the error at the pre-activation state ($\mathbf{z}^{(1)}$). This is necessary to update the weights $W^{(1)}$ later.
The Chain Rule Path: The Loss depends on $\mathbf{a}^{(1)}$, and $\mathbf{a}^{(1)}$ depends on $\mathbf{z}^{(1)}$ via the activation function:
$$ \mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) $$
Therefore, by the Chain Rule:
$$ \delta^{(1)} = \dfrac{\partial L}{\partial \mathbf{z}^{(1)}} = \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} \cdot \dfrac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{z}^{(1)}} $$
1. The First Term (Upstream Gradient): We calculated this in Step 3:
$$ \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} = (W^{(2)})^T \cdot \delta^{(2)} $$
2. The Second Term (Local Gradient): This is the derivative of the activation function itself. Since $\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)})$, the derivative is simply:
$$ \begin{aligned} \dfrac{\partial \mathbf{a}^{(1)}}{\partial \mathbf{z}^{(1)}} &= \sigma'_{\text{ReLU}}(\mathbf{z}^{(1)}) &&\ &= \begin{cases} 1 & \text{if } z > 0 \ 0 & \text{if } z \le 0 \end{cases} \end{aligned} $$
(This is often written as the indicator function $\mathbb{1}[\mathbf{z}^{(1)} > 0]$ ).
3. Combine (Element-wise Multiplication): Unlike the weights (which mix all inputs together via matrix multiplication), the activation function operates on every neuron independently.
- Neuron 1's activation only depends on Neuron 1's $z$.
- Neuron 2's activation only depends on Neuron 2's $z$.
Because there is no "mixing" between neurons here, we use Element-wise Multiplication (Hadamard Product, denoted by $\odot$), not Matrix Multiplication.
$$ \begin{aligned} \delta^{(1)} &= \dfrac{\partial L}{\partial \mathbf{a}^{(1)}} \odot \sigma'_{\text{ReLU}}(\mathbf{z}^{(1)}) \ &= \left( (W^{(2)})^T \delta^{(2)} \right) \odot \mathbb{1}[\mathbf{z}^{(1)} > 0] \end{aligned} $$
Intuition: Think of the ReLU derivative as a Gate:
- If the neuron fired ($z > 0$), the gate is Open (1). The error flows through unchanged.
- If the neuron was dead ($z \le 0$), the gate is Closed (0). The error is killed, and no gradient flows back to update the weights for this specific neuron.
Step 5: Gradients for Layer 1 Parameters ($W^{(1)}, b^{(1)}$)
Finally, we calculate gradients for the first layer weights. This follows the exact same pattern as Step 2.
$$ \begin{aligned} \dfrac{\partial L}{\partial W^{(1)}} &= \delta^{(1)} \cdot (\mathbf{x})^T \ \dfrac{\partial L}{\partial \mathbf{b}^{(1)}} &= \delta^{(1)} \end{aligned} $$
Summary Algorithm (General Form)
For any layer $l$:
- Compute Error term ($\delta^{(l)}$):
- Output Layer: $\widehat{y} - y$
- Hidden Layer: $(W^{(l+1)})^T \delta^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})$
- Compute Weight Gradient:
- $\delta^{(l)} \cdot (\mathbf{a}^{(l-1)})^T$
- Compute Bias Gradient:
- $\delta^{(l)}$