08 Convolutional Neural Network (CNN) Operations - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

1. Convolution Operation

The basic convolution operation works by placing a small filter (kernel) over a local region of the input image.

At each position:

Multiply each input value by the corresponding filter value
Sum all the results
Store the sum as one value in the output

Convolution is:

element-wise multiply, then sum

A single filter applied to one local patch produces one output value.

Example

Imagine we are looking at a 3×3 patch of an image, and we apply a 3×3 filter.

Image Patch:

1	2	0
0	1	3
2	3	1

Filter:

1	0	-1
1	0	-1
1	0	-1

Calculation (Element-wise multiply, then sum):

Top row: $(1×1) + (2×0) + (0×-1) = 1 + 0 + 0$
Middle row: $(0×1) + (1×0) + (3×-1) = 0 + 0 - 3$
Bottom row: $(2×1) + (3×0) + (1×-1) = 2 + 0 - 1$

So the convolution output for this patch is the total sum:

$$ (1 + 0 + 0) + (0 + 0 - 3) + (2 + 0 - 1) = -1 $$

2. Sliding the Filter

The filter does not stay in one place.
It slides across the image from left to right and top to bottom.

At every location, the same computation is repeated:

take the local patch
apply the filter
compute one number

All of these output values together form a feature map:

one patch → one output value
sliding across the image → full output feature map

After convolution produces a raw feature map, a nonlinear activation function such as ReLU is typically applied element-wise. The resulting activated feature map is then passed to the next layer.

3. Filters Detect Features

Different filters detect different visual patterns, such as:

Vertical edge detectors
Horizontal edge detectors
Averaging filters (blurring)

This implies that CNN filters can learn to identify informative visual features, including:

edges
corners
textures
shapes

Each filter tends to specialize in responding to a particular type of pattern present in the input.

In traditional image processing, filters such as horizontal edge detectors or vertical edge detectors are usually manually designed and remain fixed during computation.

In contrast, in a CNN, the filter weights are typically initialized randomly and are then learned automatically from data during training through backpropagation and gradient-based optimization.

As training progresses, the network adjusts these filter weights so that they become effective feature detectors for the target task.

In general:

earlier convolutional layers tend to learn low-level features such as edges, lines, and color contrasts
deeper convolutional layers tend to learn higher-level and more abstract features such as textures, object parts, and complex shapes

4. Stride

Stride means how many pixels the filter moves each step.

Stride = 1

move one pixel at a time
more overlap between neighboring patches
more computation
larger output

Stride = 2

move two pixels at a time
less overlap
less computation
smaller output

A higher stride produces a smaller output, so increasing the stride is a form of downsampling.

5. Padding

Without padding, the output becomes smaller than the input because the filter cannot fully cover the border pixels.

Valid Padding

no padding added
output is smaller than input

Example:

input: 5 × 5
filter: 3 × 3
output: 3 × 3

Same Padding

add zeros around the border before convolution
output size is preserved (for stride 1)

Example:

input: 5 × 5
filter: 3 × 3
output: 5 × 5

So padding helps preserve border information and control output size.

6. Output Size Formula

$$ \text{Output Size} = \left\lfloor \frac{\text{Input} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} \right\rfloor + 1 $$

1. Apply Separately to Height and Width

A CNN processes images in a 2D grid. The filter slides both horizontally (Left → Right) and vertically (Top → Bottom).

The formula above is used independently for both dimensions:

To find Output Height: plug in Input Height
To find Output Width: plug in Input Width

Example:
If your input is 5×5 pixels, and the math tells you the output length is 3, then your full output feature map is 3×3 (3 rows × 3 columns).

2. Derivation (for a single dimension)

Think of the sliding filter as walking along a path. Here is how the parts correspond to the movement:

A. Available Distance (`Input + 2×Padding - Filter`)

Before the filter can slide, we must determine how much room it has.

Start with the input size
Add Padding (extends the space on both sides: left+right or top+bottom)
Subtract Filter Size because the filter occupies space; you cannot place it such that it hangs off the edge

This gives you the total distance available for movement.

B. Number of Steps (`÷ Stride`)

Now, divide that available distance by the Stride (how many pixels you jump at a time).

If Stride = 1, you take many small steps
If Stride = 2, you skip pixels
We use floor brackets ($\lfloor \dots \rfloor$) because partial steps at the end do not count as valid output positions

C. The Starting Position (`+ 1`)

Finally, add 1 to the count.

Because taking 0 steps still means the filter can be placed at 1 valid starting position
In math terms: Total Positions = Total Steps + 1

7. Multiple Filters Produce Multiple Feature Maps

A convolutional layer usually uses multiple filters, not just one.

Each filter is applied across the input, and each filter produces one feature map.
Therefore, if a layer uses K filters, the output will contain K feature maps stacked along the channel dimension.

Number of filters = output depth (number of output channels)

Filter Shape and Input Channels

In a filter of shape height × width × depth, the third dimension is the depth (or channel dimension) of the filter, and it must match the number of channels in the input.

For example, if the input is an RGB image of shape:

$$ 8 \times 8 \times 3 $$

then a single 3 × 3 filter is actually:

$$ 3 \times 3 \times 3 $$

This is considered one single filter, not three separate filters.
The three channel slices correspond to the three input channels (e.g., Red, Green, Blue), and their contributions are summed together to produce one output value at each spatial location.

Example: Multiple Filters

Suppose the input has shape:

$$ 8 \times 8 \times 3 $$

and the convolutional layer uses 4 filters, each of size:

$$ 3 \times 3 \times 3 $$

Then:

each filter spans the full input depth of 3
each filter produces one feature map
using 4 filters produces 4 output feature maps

So the output depth becomes:

$$ 4 $$

and the output shape is approximately:

$$ H_{out} \times W_{out} \times 4 $$

where $H_{out}$ and $W_{out}$ depend on the filter size, stride, and padding.

Relationship Between Consecutive Layers

In a CNN, the output depth of one layer becomes the input depth of the next layer.
Therefore, the depth of each filter in the next layer must match the number of output channels produced by the previous layer.

For example:

if one layer outputs 32 feature maps
then the next layer receives input depth 32
so each filter in the next layer must have depth 32

Important Terminology Note

The term depth can be confusing in CNNs because it may refer to two different things:

Filter depth: the number of input channels that a filter spans
Output depth: the number of output feature maps produced by a layer

Design Choice vs Structural Constraint

The number of filters in a convolutional layer is chosen by the CNN designer.
It is an architectural hyperparameter and is often selected based on prior knowledge, common design patterns, or empirical experimentation.

In contrast, the depth of each filter is not chosen independently.
It is constrained by the input to that layer and must match the number of input channels, which is usually the output depth of the previous layer.

8. Pooling

Pooling is a downsampling operation commonly used in CNNs to reduce the spatial dimensions (height and width) of feature maps.

Why Use Pooling?

Without pooling, convolution layers may preserve the spatial size of the input (for example, when using same padding and stride 1). Pooling is a common way to reduce spatial dimensions. Although convolution can also reduce spatial size when using larger strides or no padding, pooling is often used as an explicit downsampling operation. For example,

$$ 224 \times 224 \rightarrow 112 \times 112 \rightarrow 56 \times 56 \rightarrow 28 \times 28 \rightarrow 14 \times 14 \rightarrow 7 \times 7 $$

In standard pooling operations, pooling is applied independently to each feature map. Therefore, pooling reduces the spatial dimensions (height and width) but preserves the depth (number of channels).

Benefits of Pooling

Pooling provides several advantages:

reduces the spatial size of feature maps
reduces computation in later layers
reduces the number of parameters in later layers
increases the effective receptive field (See 8.5) of later neurons
provides some robustness to small translations in the input

8.1 Max Pooling

Max pooling divides the feature map into small regions and outputs the maximum value from each region.

For example, using a 2 × 2 pooling window with stride = 2:

Input

1	3	2	4
5	6	1	2
3	2	7	8
1	0	3	4

The 2 × 2 pooling regions are:

top-left block:

1	3
5	6

top-right block:

2	4
1	2

bottom-left block:

3	2
1	0

bottom-right block:

7	8
3	4

Take the maximum from each block:

top-left block: max(1, 3, 5, 6) = 6
top-right block: max(2, 4, 1, 2) = 4
bottom-left block: max(3, 2, 1, 0) = 3
bottom-right block: max(7, 8, 3, 4) = 8

Output after Max Pooling

6	4
3	8

Max pooling keeps the strongest activation in each region.

8.2 Average Pooling

Average pooling divides the feature map into small regions and outputs the average value from each region.

Using the same 2 × 2 pooling window with stride = 2 on the same input example of Max Pooling:

We take the average from each 2 × 2 block:

top-left block: (1 + 3 + 5 + 6) / 4 = 3.75
top-right block: (2 + 4 + 1 + 2) / 4 = 2.25
bottom-left block: (3 + 2 + 1 + 0) / 4 = 1.5
bottom-right block: (7 + 8 + 3 + 4) / 4 = 5.5

Output after Average Pooling

3.75	2.25
1.5	5.5

Average pooling keeps the average response in each region instead of the strongest one.

8.3 Translation Invariance

One important benefit of pooling, especially max pooling, is that it provides some robustness to small shifts in the input.

If a strong feature moves slightly within the same pooling window, the pooled output may remain unchanged.

For example:

if a high activation value appears at one position inside a pooling region
and then shifts by 1 pixel but still stays inside that same region
max pooling may still output the same maximum value

This gives the network a degree of translation invariance (more precisely, robustness to small translations).

Common Pooling Settings

Typical pooling settings in CNNs are:

Pool size: 2 × 2
Stride: 2
Type: Max pooling

8.4 Global Average Pooling

Global Average Pooling (GAP) is a special form of average pooling where the pooling window covers the entire spatial dimension of each feature map.

For example:

$$ 7 \times 7 \times 512 \rightarrow 1 \times 1 \times 512 $$

This means:

each of the 512 feature maps is reduced to a single average value
the final output contains 512 numbers

Global Average Pooling is often used near the end of a CNN, before the final classification layer.

Why use Global Average Pooling?

greatly reduces the number of parameters
avoids flattening a large feature map into a huge vector
commonly used before the final dense / softmax classification layer

8.5 Receptive Field

The receptive field refers to the spatial extent of the portion of the original input image that a specific feature (neuron) in a deeper layer is "looking at."

While a $3 \times 3$ filter always covers a $3 \times 3$ area of its immediate input, its "view" of the original pixels expands as we move deeper into the network.

How the Receptive Field Grows

Stacking Layers: Each consecutive convolution adds to the total area covered. For example, two stacked $3 \times 3$ layers provide a $5 \times 5$ receptive field.
Pooling & Stride: These are the most aggressive ways to increase the receptive field. By downsampling the data (e.g., $2 \times 2$ Max Pooling), you effectively "zoom out," allowing a small filter in the next layer to see a much larger percentage of the original image.

Tutor Insight: > A large receptive field is critical for tasks like Object Detection. If a neuron's receptive field is only $10 \times 10$ pixels, it might see a "yellow patch" but won't have enough context to know if that patch belongs to a "school bus" or a "banana." Deep layers need a large receptive field to "see" the whole object.

9. Typical CNN Architecture Pattern

A CNN is typically divided into two distinct stages:

Feature Extraction
Classification

The data flows sequentially: Input Image → Feature Extraction → Flatten → Classification

Feature Extraction

This stage consists of repeated blocks of operations. A standard block usually follows the pattern: Conv + ReLU + Pool.

Operations within a block:

Layer	Operation	Effect
Conv	Apply filters	Detect features
ReLU	Non-linearity	Enable complex patterns
Pool	Downsample	Reduce size, add invariance

Changes to the data shape: As the signal passes through multiple Conv + Pool blocks, a specific structural pattern emerges:

Spatial dimensions decrease: Height and width are halved (roughly) at each block.
Number of filters increases: The depth is doubled (roughly) at each block.
Feature Abstraction: The network builds more abstract features as it goes deeper.

Transition: Flattening

After the feature extraction stage, a Flatten layer is used. This converts the multi-dimensional feature maps (3D) into a single long vector (1D) to serve as input for the next stage.

How Flattening works: The layer "unrolls" the 3D volume. The feature maps (channels) are stacked end-to-end to form a single continuous list of numbers.

Note: The exact order—whether it goes row-by-row then channel-by-channel, or pixel-by-pixel across channels—depends on the specific code framework, but conceptually, all values are retained.

Formula (Vector Length): Although the values are strictly rearranged (stacked), the total length of the resulting 1D vector is the product of the dimensions:

$$ \text{Vector Length} = \text{Height} \times \text{Width} \times \text{Depth} $$

Example: If the final feature maps have shape 4 × 4 and there are 512 filters (depth):

The 3D Input is: $4 \times 4 \times 512$
The 1D Output vector length is: $8,192$ (derived from $4 \times 4 \times 512$)

This long vector of 8,192 values is then passed to the classification stage.

Classification

The final stage is responsible for assigning the input to a class. It uses Dense (Fully Connected) layers.

Dense + ReLU: Further processing of the flattened features.
Dense (Softmax): The final layer that outputs the classification probabilities.

10. ResNet (Skip Connection)

Training very deep neural networks is notoriously difficult. While theoretical logic suggests that adding more layers should allow a network to learn more complex features, in practice, standard "plain" networks suffer from the Degradation Problem.

The Degradation Problem

In traditional ("plain") networks without skip connections:

Vanishing/Exploding Gradients: As gradients are backpropagated through many layers, they can shrink to zero or explode to infinity, making convergence difficult.
Performance Degradation: As you increase the number of layers (e.g., from 20 to 56), the training error often increases. This is not overfitting (where training error is low and test error is high); rather, the optimization algorithm simply struggles to train the deeper network effectively.

ResNet (Residual Network) solves this by introducing Skip Connections (also called Shortcuts).

The Residual Block

A ResNet is built by stacking many Residual Blocks. A residual block consists of a "Main Path" and a "Shortcut".

The Main Path

The information flows through a standard stack of layers (typically Convolution $\to$ Batch Norm $\to$ ReLU $\to$ Convolution $\to$ Batch Norm).

Let the input to a block be $a^{[l]}$. The main path computes a function $F(a^{[l]})$, which results in $z^{[l+2]}$.

The Shortcut (Skip Connection)

The input $a^{[l]}$ is "fast-forwarded" past the layers and added directly to the output of the main path before the final non-linearity (ReLU) is applied.

Mathematical Formulation

In a standard (plain) network, the flow is:

$$ a^{[l+2]} = g(z^{[l+2]}) $$

In a Residual Block, the flow is:

$$ a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) $$

Where:

$a^{[l]}$ is the activation from the previous block.
$z^{[l+2]}$ is the output of the current block's linear operations (weights and bias).
$g$ is the activation function (usually ReLU).
The addition $(z^{[l+2]} + a^{[l]})$ happens element-wise.

11.3 Why Skip Connections Work (The Identity Intuition)

The core reason ResNets work is that they make it very easy for a block to learn the Identity Function.

Imagine we have a deep network, and we add extra layers to it. In a worst-case scenario, these new layers should simply do nothing (copy the input to the output) so that they don't hurt the performance.

In a Plain Network: For the layers to learn the identity function ($a^{[l+2]} = a^{[l]}$), the network must adjust the weights and biases precisely so that the non-linear transformations result in the input. This is mathematically difficult to learn.

In a ResNet: Look at the formula again:

$$ a^{[l+2]} = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]}) $$

If the regularizer (like L2 regularization) shrinks the weights $W$ and biases $b$ toward zero, the term inside the parenthesis becomes:

$$ 0 + a^{[l]} $$

Assuming ReLU activation (where $g(x) = x$ for positive inputs), the output becomes:

$$ a^{[l+2]} = a^{[l]} $$

Conclusion:

It is easy for the network to set weights to zero and turn the block into an identity mapping ($F(x) = 0$).
This guarantees that a deeper network will perform at least as well as a shallower network.
The network can then learn to extract new features only if they actually improve performance (learning the "residual").

Handling Dimension Mismatches

To perform the element-wise addition $(z^{[l+2]} + a^{[l]})$, the dimensions of $z^{[l+2]}$ and $a^{[l]}$ must match perfectly.

Case 1: Same Convolutions (Dimensions Preserved) If the block uses "Same" padding and stride 1, the dimensions match automatically. The addition is straightforward.

Case 2: Pooling or Stride > 1 (Dimensions Change) If the height/width decreases or the number of channels changes inside the block, we cannot add $a^{[l]}$ directly. We must resize $a^{[l]}$.

There are two common strategies applied to the shortcut:

Learned Matrix ($W_s$): Use a $1 \times 1$ convolution (represented as matrix $W_s$) to linearly project $a^{[l]}$ to the correct shape.

$$ a^{[l+2]} = g(z^{[l+2]} + W_s a^{[l]}) $$

Zero Padding: Simply pad the input $a^{[l]}$ with zeros to match the dimensions (no extra parameters to learn).

Summary of Results

Plain Networks: Deeper $\neq$ Better. Error increases as depth increases.
ResNets: Deeper $=$ Better. Because the "degradation" problem is solved via identity mapping, we can train networks with 100+ or even 1000+ layers, and the error continues to decrease.

11. Data Augmentation

Deep neural networks suffer from a Data Hunger Problem.

Small datasets lead to overfitting.
More data leads to better generalization.
However, collecting and labeling new data is often expensive and time-consuming.

Data Augmentation solves this by creating more training examples artificially from the existing data. It relies on the concept that semantic meaning is invariant to transformation:

A flipped cat is still a cat.

How Augmentation Helps

Augmentation effectively increases the dataset size dramatically without collecting new images.

Scenario	What happens?	Consequence
Without Augmentation	The model sees the same exact images every epoch.	It learns to memorize specific pixel patterns. High risk of overfitting.
With Augmentation	The model sees different versions of the image each epoch.	It is forced to learn robust features rather than specific pixels. Acts as a form of regularization.

Common Augmentation Parameters

Different transformations simulate different real-world variations.

Augmentation	Typical Range	Notes
Rotation	$\pm 10^{\circ}$ to $\pm 30^{\circ}$	Depends on object orientation (e.g., a tree vs. a ball).
Width/Height Shift	$10 - 20%$	Simulates the object moving position within the frame.
Horizontal Flip	On/Off	Great for animals/objects. Not for text or asymmetric objects.
Zoom	$0.9$ to $1.1$	Simulates camera distance.
Brightness	$0.8$ to $1.2$	Simulates different lighting conditions.

The Key Principle (Train vs. Test)

A critical rule in deep learning pipelines:

Apply augmentation ONLY during training.
- Training: We want diversity and difficulty to learn robust features.
NEVER augment validation or test data.
- Validation/Testing: We want to evaluate performance on the "real" unmodified data to get accurate metrics.

Best Practices

Do:

Start with mild augmentation; increase the intensity if the model is still overfitting.
Combine multiple augmentations (e.g., rotate AND change brightness).
Choose transforms that are appropriate for the specific task.

Don't:

Do not use transforms that change the label.
- Example: Do not vertically flip the digit "6" (it becomes "9").
- Example: Do not horizontally flip text recognition data.
Do not over-augment (distort images so much that features are destroyed).

12. Transfer Learning

Transfer Learning is a technique that leverages knowledge learned by a model trained on a massive dataset and applies it to a new, often smaller, target task. It avoids the need to train a complex network from scratch.

The Core Idea

The process follows two primary steps:

Pre-training: A CNN is first trained on a very large dataset (e.g., ImageNet with 1M+ images). In this phase, the network learns general features such as edges, textures, shapes, and object parts.
Adaptation: The pre-trained weights are transferred to your new model. You modify the final layers to match your specific classes and train only these new layers (or fine-tune the whole model) using your smaller dataset (often just a few hundred images).

Benefits:

Works effectively with small datasets.
Trains much faster (in minutes rather than days).
Often achieves higher accuracy than training from scratch due to better feature initialization.

Why Features Transfer

CNN layers learn features hierarchically. The deeper you go, the more abstract and specific the features become.

Layer Stage	Features Learned	Transferability
Early	Edges, colors	Very general (Always useful)
Middle	Textures, patterns	Fairly general
Late	Object parts (e.g., eyes, wheels)	Somewhat task-specific
Final	Full objects (e.g., cats, cars)	Task-specific (Usually not reused)

Because early and middle layers detect fundamental visual structures, they work well for almost any image task.

Implementation Steps

To apply transfer learning, follow this standard workflow:

Load a pre-trained model: Load weights from a model trained on a large dataset like ImageNet.
Remove original classification layers: Discard the dense layers at the top that were specific to the original task (e.g., the original 1000 ImageNet classes).
Add new classification head: Build new layers for your task. This typically includes:
- Global Average Pooling (to reduce spatial dimensions).
- One or two Dense layers.
- A final Softmax layer matching your number of classes.
Train: Depending on the strategy (see below), freeze some layers and train the rest.

There are two main strategies for adapting a pre-trained model:

1. Feature Extraction

In this approach, the pre-trained layers act solely as fixed feature extractors.

Action: Freeze the pre-trained convolutional base (set trainable = False).
Training: Only train the new classifier head.
Use Case: Best when your dataset is tiny and similar to ImageNet.
Pros: Very fast to train; low risk of overfitting the base weights.

2. Fine-Tuning

In this approach, we adjust the pre-trained weights to adapt them specifically to our data distribution.

Action: Unfreeze some or all of the convolutional layers.
Training: Train the entire network with a small learning rate (to make small adjustments to already good weights).
Use Case: Better if you have more data, especially if your data is different from ImageNet.
Why Small LR? Prevents destroying the generic features learned during pre-training.

When to Use What?

The choice between Feature Extraction and Fine-Tuning depends on your data volume and similarity to the source domain.

Your Data	Amount	Recommended Approach
Similar to ImageNet	Small (100s)	Feature Extraction
Similar to ImageNet	Large (10K+)	Fine-tune Top Layers
Different from ImageNet	Small	Feature Extraction + Heavy Augmentation
Different from ImageNet	Large	Fine-tune Whole Network

Guideline: Start with Feature Extraction. If performance plateaus, try Fine-Tuning.

08 Convolutional Neural Network (CNN) Operations - chanchishing/Introduction-to-Deep-Learning GitHub Wiki

1. Convolution Operation

Example

2. Sliding the Filter

3. Filters Detect Features

4. Stride

Stride = 1

Stride = 2

5. Padding

Valid Padding

Same Padding

6. Output Size Formula

1. Apply Separately to Height and Width

2. Derivation (for a single dimension)

A. Available Distance (Input + 2×Padding - Filter)

B. Number of Steps (÷ Stride)

C. The Starting Position (+ 1)

7. Multiple Filters Produce Multiple Feature Maps

Filter Shape and Input Channels

Example: Multiple Filters

Relationship Between Consecutive Layers

Important Terminology Note

Design Choice vs Structural Constraint

8. Pooling

Why Use Pooling?

Benefits of Pooling

8.1 Max Pooling

8.2 Average Pooling

8.3 Translation Invariance

8.4 Global Average Pooling

Why use Global Average Pooling?

8.5 Receptive Field

How the Receptive Field Grows

9. Typical CNN Architecture Pattern

Feature Extraction

Transition: Flattening

Classification

10. ResNet (Skip Connection)

The Degradation Problem

The Residual Block

The Main Path

The Shortcut (Skip Connection)

Mathematical Formulation

11.3 Why Skip Connections Work (The Identity Intuition)

Handling Dimension Mismatches

Summary of Results

11. Data Augmentation

How Augmentation Helps

Common Augmentation Parameters

The Key Principle (Train vs. Test)

Best Practices

12. Transfer Learning

The Core Idea

Why Features Transfer

Implementation Steps

1. Feature Extraction

2. Fine-Tuning

When to Use What?

⚠️ **GitHub.com Fallback** ⚠️

A. Available Distance (`Input + 2×Padding - Filter`)

B. Number of Steps (`÷ Stride`)

C. The Starting Position (`+ 1`)

⚠️ GitHub.com Fallback ⚠️