08 Convolutional Neural Network (CNN) Operations - chanchishing/Introduction-to-Deep-Learning GitHub Wiki
The basic convolution operation works by placing a small filter (kernel) over a local region of the input image.
At each position:
- Multiply each input value by the corresponding filter value
- Sum all the results
- Store the sum as one value in the output
Convolution is:
element-wise multiply, then sum
A single filter applied to one local patch produces one output value.
Imagine we are looking at a 3×3 patch of an image, and we apply a 3×3 filter.
Image Patch:
| 1 | 2 | 0 |
| 0 | 1 | 3 |
| 2 | 3 | 1 |
Filter:
| 1 | 0 | -1 |
| 1 | 0 | -1 |
| 1 | 0 | -1 |
Calculation (Element-wise multiply, then sum):
- Top row:
$(1×1) + (2×0) + (0×-1) = 1 + 0 + 0$ - Middle row:
$(0×1) + (1×0) + (3×-1) = 0 + 0 - 3$ - Bottom row:
$(2×1) + (3×0) + (1×-1) = 2 + 0 - 1$
So the convolution output for this patch is the total sum:
The filter does not stay in one place.
It slides across the image from left to right and top to bottom.
At every location, the same computation is repeated:
- take the local patch
- apply the filter
- compute one number
All of these output values together form a feature map:
- one patch → one output value
- sliding across the image → full output feature map
After convolution produces a raw feature map, a nonlinear activation function such as ReLU is typically applied element-wise. The resulting activated feature map is then passed to the next layer.
Different filters detect different visual patterns, such as:
- Vertical edge detectors
- Horizontal edge detectors
- Averaging filters (blurring)
This implies that CNN filters can learn to identify informative visual features, including:
- edges
- corners
- textures
- shapes
Each filter tends to specialize in responding to a particular type of pattern present in the input.
In traditional image processing, filters such as horizontal edge detectors or vertical edge detectors are usually manually designed and remain fixed during computation.
In contrast, in a CNN, the filter weights are typically initialized randomly and are then learned automatically from data during training through backpropagation and gradient-based optimization.
As training progresses, the network adjusts these filter weights so that they become effective feature detectors for the target task.
In general:
- earlier convolutional layers tend to learn low-level features such as edges, lines, and color contrasts
- deeper convolutional layers tend to learn higher-level and more abstract features such as textures, object parts, and complex shapes
Stride means how many pixels the filter moves each step.
- move one pixel at a time
- more overlap between neighboring patches
- more computation
- larger output
- move two pixels at a time
- less overlap
- less computation
- smaller output
A higher stride produces a smaller output, so increasing the stride is a form of downsampling.
Without padding, the output becomes smaller than the input because the filter cannot fully cover the border pixels.
- no padding added
- output is smaller than input
Example:
- input: 5 × 5
- filter: 3 × 3
- output: 3 × 3
- add zeros around the border before convolution
- output size is preserved (for stride 1)
Example:
- input: 5 × 5
- filter: 3 × 3
- output: 5 × 5
So padding helps preserve border information and control output size.
A CNN processes images in a 2D grid. The filter slides both horizontally (Left → Right) and vertically (Top → Bottom).
The formula above is used independently for both dimensions:
- To find Output Height: plug in
Input Height - To find Output Width: plug in
Input Width
Example:
If your input is 5×5 pixels, and the math tells you the output length is 3, then your full output feature map is 3×3 (3 rows × 3 columns).
Think of the sliding filter as walking along a path. Here is how the parts correspond to the movement:
Before the filter can slide, we must determine how much room it has.
- Start with the input size
- Add Padding (extends the space on both sides: left+right or top+bottom)
- Subtract Filter Size because the filter occupies space; you cannot place it such that it hangs off the edge
This gives you the total distance available for movement.
Now, divide that available distance by the Stride (how many pixels you jump at a time).
- If Stride = 1, you take many small steps
- If Stride = 2, you skip pixels
- We use floor brackets (
$\lfloor \dots \rfloor$ ) because partial steps at the end do not count as valid output positions
Finally, add 1 to the count.
- Because taking 0 steps still means the filter can be placed at 1 valid starting position
- In math terms: Total Positions = Total Steps + 1
A convolutional layer usually uses multiple filters, not just one.
Each filter is applied across the input, and each filter produces one feature map.
Therefore, if a layer uses K filters, the output will contain K feature maps stacked along the channel dimension.
Number of filters = output depth (number of output channels)
In a filter of shape height × width × depth, the third dimension is the depth (or channel dimension) of the filter, and it must match the number of channels in the input.
For example, if the input is an RGB image of shape:
then a single 3 × 3 filter is actually:
This is considered one single filter, not three separate filters.
The three channel slices correspond to the three input channels (e.g., Red, Green, Blue), and their contributions are summed together to produce one output value at each spatial location.
Suppose the input has shape:
and the convolutional layer uses 4 filters, each of size:
Then:
- each filter spans the full input depth of 3
- each filter produces one feature map
- using 4 filters produces 4 output feature maps
So the output depth becomes:
and the output shape is approximately:
where
In a CNN, the output depth of one layer becomes the input depth of the next layer.
Therefore, the depth of each filter in the next layer must match the number of output channels produced by the previous layer.
For example:
- if one layer outputs 32 feature maps
- then the next layer receives input depth 32
- so each filter in the next layer must have depth 32
The term depth can be confusing in CNNs because it may refer to two different things:
- Filter depth: the number of input channels that a filter spans
- Output depth: the number of output feature maps produced by a layer
The number of filters in a convolutional layer is chosen by the CNN designer.
It is an architectural hyperparameter and is often selected based on prior knowledge, common design patterns, or empirical experimentation.
In contrast, the depth of each filter is not chosen independently.
It is constrained by the input to that layer and must match the number of input channels, which is usually the output depth of the previous layer.
Pooling is a downsampling operation commonly used in CNNs to reduce the spatial dimensions (height and width) of feature maps.
Without pooling, convolution layers may preserve the spatial size of the input (for example, when using same padding and stride 1). Pooling is a common way to reduce spatial dimensions. Although convolution can also reduce spatial size when using larger strides or no padding, pooling is often used as an explicit downsampling operation. For example,
In standard pooling operations, pooling is applied independently to each feature map. Therefore, pooling reduces the spatial dimensions (height and width) but preserves the depth (number of channels).
Pooling provides several advantages:
- reduces the spatial size of feature maps
- reduces computation in later layers
- reduces the number of parameters in later layers
- increases the effective receptive field (See 8.5) of later neurons
- provides some robustness to small translations in the input
Max pooling divides the feature map into small regions and outputs the maximum value from each region.
For example, using a 2 × 2 pooling window with stride = 2:
Input
| 1 | 3 | 2 | 4 |
| 5 | 6 | 1 | 2 |
| 3 | 2 | 7 | 8 |
| 1 | 0 | 3 | 4 |
The 2 × 2 pooling regions are:
- top-left block:
| 1 | 3 |
| 5 | 6 |
- top-right block:
| 2 | 4 |
| 1 | 2 |
- bottom-left block:
| 3 | 2 |
| 1 | 0 |
- bottom-right block:
| 7 | 8 |
| 3 | 4 |
Take the maximum from each block:
- top-left block:
max(1, 3, 5, 6) = 6 - top-right block:
max(2, 4, 1, 2) = 4 - bottom-left block:
max(3, 2, 1, 0) = 3 - bottom-right block:
max(7, 8, 3, 4) = 8
Output after Max Pooling
| 6 | 4 |
| 3 | 8 |
Max pooling keeps the strongest activation in each region.
Average pooling divides the feature map into small regions and outputs the average value from each region.
Using the same 2 × 2 pooling window with stride = 2 on the same input example of Max Pooling:
We take the average from each 2 × 2 block:
- top-left block:
(1 + 3 + 5 + 6) / 4 = 3.75 - top-right block:
(2 + 4 + 1 + 2) / 4 = 2.25 - bottom-left block:
(3 + 2 + 1 + 0) / 4 = 1.5 - bottom-right block:
(7 + 8 + 3 + 4) / 4 = 5.5
Output after Average Pooling
| 3.75 | 2.25 |
| 1.5 | 5.5 |
Average pooling keeps the average response in each region instead of the strongest one.
One important benefit of pooling, especially max pooling, is that it provides some robustness to small shifts in the input.
If a strong feature moves slightly within the same pooling window, the pooled output may remain unchanged.
For example:
- if a high activation value appears at one position inside a pooling region
- and then shifts by 1 pixel but still stays inside that same region
- max pooling may still output the same maximum value
This gives the network a degree of translation invariance (more precisely, robustness to small translations).
- Common Pooling Settings
Typical pooling settings in CNNs are:
-
Pool size:
2 × 2 -
Stride:
2 -
Type:
Max pooling
Global Average Pooling (GAP) is a special form of average pooling where the pooling window covers the entire spatial dimension of each feature map.
For example:
This means:
- each of the 512 feature maps is reduced to a single average value
- the final output contains 512 numbers
Global Average Pooling is often used near the end of a CNN, before the final classification layer.
- greatly reduces the number of parameters
- avoids flattening a large feature map into a huge vector
- commonly used before the final dense / softmax classification layer
The receptive field refers to the spatial extent of the portion of the original input image that a specific feature (neuron) in a deeper layer is "looking at."
While a
-
Stacking Layers: Each consecutive convolution adds to the total area covered. For example, two stacked
$3 \times 3$ layers provide a$5 \times 5$ receptive field. -
Pooling & Stride: These are the most aggressive ways to increase the receptive field. By downsampling the data (e.g.,
$2 \times 2$ Max Pooling), you effectively "zoom out," allowing a small filter in the next layer to see a much larger percentage of the original image.
Tutor Insight: > A large receptive field is critical for tasks like Object Detection. If a neuron's receptive field is only
$10 \times 10$ pixels, it might see a "yellow patch" but won't have enough context to know if that patch belongs to a "school bus" or a "banana." Deep layers need a large receptive field to "see" the whole object.
A CNN is typically divided into two distinct stages:
- Feature Extraction
- Classification
The data flows sequentially:
Input Image → Feature Extraction → Flatten → Classification
This stage consists of repeated blocks of operations. A standard block usually follows the pattern: Conv + ReLU + Pool.
Operations within a block:
| Layer | Operation | Effect |
|---|---|---|
| Conv | Apply filters | Detect features |
| ReLU | Non-linearity | Enable complex patterns |
| Pool | Downsample | Reduce size, add invariance |
Changes to the data shape: As the signal passes through multiple Conv + Pool blocks, a specific structural pattern emerges:
- Spatial dimensions decrease: Height and width are halved (roughly) at each block.
- Number of filters increases: The depth is doubled (roughly) at each block.
- Feature Abstraction: The network builds more abstract features as it goes deeper.
After the feature extraction stage, a Flatten layer is used. This converts the multi-dimensional feature maps (3D) into a single long vector (1D) to serve as input for the next stage.
How Flattening works: The layer "unrolls" the 3D volume. The feature maps (channels) are stacked end-to-end to form a single continuous list of numbers.
Note: The exact order—whether it goes row-by-row then channel-by-channel, or pixel-by-pixel across channels—depends on the specific code framework, but conceptually, all values are retained.
Formula (Vector Length): Although the values are strictly rearranged (stacked), the total length of the resulting 1D vector is the product of the dimensions:
Example: If the final feature maps have shape 4 × 4 and there are 512 filters (depth):
- The 3D Input is:
$4 \times 4 \times 512$ - The 1D Output vector length is:
$8,192$ (derived from$4 \times 4 \times 512$ )
This long vector of 8,192 values is then passed to the classification stage.
The final stage is responsible for assigning the input to a class. It uses Dense (Fully Connected) layers.
- Dense + ReLU: Further processing of the flattened features.
- Dense (Softmax): The final layer that outputs the classification probabilities.
Training very deep neural networks is notoriously difficult. While theoretical logic suggests that adding more layers should allow a network to learn more complex features, in practice, standard "plain" networks suffer from the Degradation Problem.
In traditional ("plain") networks without skip connections:
- Vanishing/Exploding Gradients: As gradients are backpropagated through many layers, they can shrink to zero or explode to infinity, making convergence difficult.
- Performance Degradation: As you increase the number of layers (e.g., from 20 to 56), the training error often increases. This is not overfitting (where training error is low and test error is high); rather, the optimization algorithm simply struggles to train the deeper network effectively.
ResNet (Residual Network) solves this by introducing Skip Connections (also called Shortcuts).
A ResNet is built by stacking many Residual Blocks. A residual block consists of a "Main Path" and a "Shortcut".
The information flows through a standard stack of layers (typically Convolution
Let the input to a block be
The input
In a standard (plain) network, the flow is:
In a Residual Block, the flow is:
Where:
-
$a^{[l]}$ is the activation from the previous block. -
$z^{[l+2]}$ is the output of the current block's linear operations (weights and bias). -
$g$ is the activation function (usually ReLU). - The addition
$(z^{[l+2]} + a^{[l]})$ happens element-wise.
The core reason ResNets work is that they make it very easy for a block to learn the Identity Function.
Imagine we have a deep network, and we add extra layers to it. In a worst-case scenario, these new layers should simply do nothing (copy the input to the output) so that they don't hurt the performance.
In a Plain Network:
For the layers to learn the identity function (
In a ResNet: Look at the formula again:
If the regularizer (like L2 regularization) shrinks the weights
Assuming ReLU activation (where
Conclusion:
- It is easy for the network to set weights to zero and turn the block into an identity mapping (
$F(x) = 0$ ). - This guarantees that a deeper network will perform at least as well as a shallower network.
- The network can then learn to extract new features only if they actually improve performance (learning the "residual").
To perform the element-wise addition
Case 1: Same Convolutions (Dimensions Preserved) If the block uses "Same" padding and stride 1, the dimensions match automatically. The addition is straightforward.
Case 2: Pooling or Stride > 1 (Dimensions Change)
If the height/width decreases or the number of channels changes inside the block, we cannot add
There are two common strategies applied to the shortcut:
-
Learned Matrix (
$W_s$ ): Use a$1 \times 1$ convolution (represented as matrix$W_s$ ) to linearly project$a^{[l]}$ to the correct shape.
-
Zero Padding: Simply pad the input
$a^{[l]}$ with zeros to match the dimensions (no extra parameters to learn).
-
Plain Networks: Deeper
$\neq$ Better. Error increases as depth increases. -
ResNets: Deeper
$=$ Better. Because the "degradation" problem is solved via identity mapping, we can train networks with 100+ or even 1000+ layers, and the error continues to decrease.
Deep neural networks suffer from a Data Hunger Problem.
- Small datasets lead to overfitting.
- More data leads to better generalization.
- However, collecting and labeling new data is often expensive and time-consuming.
Data Augmentation solves this by creating more training examples artificially from the existing data. It relies on the concept that semantic meaning is invariant to transformation:
- A flipped cat is still a cat.
Augmentation effectively increases the dataset size dramatically without collecting new images.
| Scenario | What happens? | Consequence |
|---|---|---|
| Without Augmentation | The model sees the same exact images every epoch. | It learns to memorize specific pixel patterns. High risk of overfitting. |
| With Augmentation | The model sees different versions of the image each epoch. | It is forced to learn robust features rather than specific pixels. Acts as a form of regularization. |
Different transformations simulate different real-world variations.
| Augmentation | Typical Range | Notes |
|---|---|---|
| Rotation |
|
Depends on object orientation (e.g., a tree vs. a ball). |
| Width/Height Shift | Simulates the object moving position within the frame. | |
| Horizontal Flip | On/Off | Great for animals/objects. Not for text or asymmetric objects. |
| Zoom |
|
Simulates camera distance. |
| Brightness |
|
Simulates different lighting conditions. |
A critical rule in deep learning pipelines:
-
Apply augmentation ONLY during training.
- Training: We want diversity and difficulty to learn robust features.
-
NEVER augment validation or test data.
- Validation/Testing: We want to evaluate performance on the "real" unmodified data to get accurate metrics.
Do:
- Start with mild augmentation; increase the intensity if the model is still overfitting.
- Combine multiple augmentations (e.g., rotate AND change brightness).
- Choose transforms that are appropriate for the specific task.
Don't:
-
Do not use transforms that change the label.
- Example: Do not vertically flip the digit "6" (it becomes "9").
- Example: Do not horizontally flip text recognition data.
- Do not over-augment (distort images so much that features are destroyed).
Transfer Learning is a technique that leverages knowledge learned by a model trained on a massive dataset and applies it to a new, often smaller, target task. It avoids the need to train a complex network from scratch.
The process follows two primary steps:
- Pre-training: A CNN is first trained on a very large dataset (e.g., ImageNet with 1M+ images). In this phase, the network learns general features such as edges, textures, shapes, and object parts.
- Adaptation: The pre-trained weights are transferred to your new model. You modify the final layers to match your specific classes and train only these new layers (or fine-tune the whole model) using your smaller dataset (often just a few hundred images).
Benefits:
- Works effectively with small datasets.
- Trains much faster (in minutes rather than days).
- Often achieves higher accuracy than training from scratch due to better feature initialization.
CNN layers learn features hierarchically. The deeper you go, the more abstract and specific the features become.
| Layer Stage | Features Learned | Transferability |
|---|---|---|
| Early | Edges, colors | Very general (Always useful) |
| Middle | Textures, patterns | Fairly general |
| Late | Object parts (e.g., eyes, wheels) | Somewhat task-specific |
| Final | Full objects (e.g., cats, cars) | Task-specific (Usually not reused) |
Because early and middle layers detect fundamental visual structures, they work well for almost any image task.
To apply transfer learning, follow this standard workflow:
- Load a pre-trained model: Load weights from a model trained on a large dataset like ImageNet.
- Remove original classification layers: Discard the dense layers at the top that were specific to the original task (e.g., the original 1000 ImageNet classes).
-
Add new classification head: Build new layers for your task. This typically includes:
- Global Average Pooling (to reduce spatial dimensions).
- One or two Dense layers.
- A final Softmax layer matching your number of classes.
- Train: Depending on the strategy (see below), freeze some layers and train the rest.
There are two main strategies for adapting a pre-trained model:
In this approach, the pre-trained layers act solely as fixed feature extractors.
-
Action: Freeze the pre-trained convolutional base (set
trainable = False). - Training: Only train the new classifier head.
- Use Case: Best when your dataset is tiny and similar to ImageNet.
- Pros: Very fast to train; low risk of overfitting the base weights.
In this approach, we adjust the pre-trained weights to adapt them specifically to our data distribution.
- Action: Unfreeze some or all of the convolutional layers.
- Training: Train the entire network with a small learning rate (to make small adjustments to already good weights).
- Use Case: Better if you have more data, especially if your data is different from ImageNet.
- Why Small LR? Prevents destroying the generic features learned during pre-training.
The choice between Feature Extraction and Fine-Tuning depends on your data volume and similarity to the source domain.
| Your Data | Amount | Recommended Approach |
|---|---|---|
| Similar to ImageNet | Small (100s) | Feature Extraction |
| Similar to ImageNet | Large (10K+) | Fine-tune Top Layers |
| Different from ImageNet | Small | Feature Extraction + Heavy Augmentation |
| Different from ImageNet | Large | Fine-tune Whole Network |
Guideline: Start with Feature Extraction. If performance plateaus, try Fine-Tuning.