AlexNetProfiling - reecewayt/llm-assisted-design-portfolio GitHub Wiki

Note: This page is under development.

Why bother with hardware acceleration?

It's well understood at this point that ML/AI algorithms can benefit from Hardware Acceleration, which I would assert is driving the development of domain-specific architectures. These aren't new ideas; Google has been discussing these ideas and working to deploy hardware solutions in their data centers since 2006 [1]. Although it's already well known that the most computationally intensive tasks in ML/AI algorithms are matrix operations, this project represents a self-study of the AlexNet architecture [2], and profiling of inference classification. The scripts discussed focus on profiling inference time (i.e. throughput/latency), algorithmic intensity, and GPU usage statistics (Pytorch Profiler).

Lastly, but unrelated, the curious reader can use this general approach for determining necessity for hardware acceleration for other workloads.

Background

The following sections give some background context and report my results from each focus area.

The overarching goal of this project is to create a hardware accelerator for inference tasks in CNNs. Before jumping to this development task, we need to answer the question of why bother with hardware acceleration. Again, as an exercise, I will be answering this question through a few experiments- although it is well known why we would develop such hardware. I'll be using the Heilmeier questions to help evaluate why hardware accelerations for CNNs are worth pursuing. I'll be discussing my answers to these in the results section.

The Heilmeier Catechism

What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice?
What is new in your approach and why do you think it will be successful?
Who cares? If you are successful, what difference will it make?
What are the risks?
How much will it cost?
How long will it take?
What are the mid-term and final “exams” to check for success?

Experimental Setup and Results

Note: This section outlines the performance characteristics of CNNs and establishes a baseline for hardware acceleration development. I implemented 3 Python scripts that capture various aspects and are discussed. My system consists of an Nvidia RTX 3050 GPU, which was used in all inference tasks. The interested reader can check this code out and run them locally, please see this README for details.

1. Inference Performance Benchmarking

I implemented a benchmarking script for the AlexNet architecture. This benchmark measures inference performance with a focus on throughput, latency, and accuracy metrics.

Methodology

The benchmarking script was implemented using PyTorch, with the pre-trained AlexNet model (AlexNet_Weights.DEFAULT) performing inference on the ImageNet validation dataset. The system captures key performance metrics including:

Total inference time
Throughput (images processed per second)
Batch inference latency (average, min, max, standard deviation)
Inference accuracy

The benchmark was executed on NVIDIA GPU hardware with CUDA acceleration enabled, processing 50 batches with a batch size of 32 (1,600 images total).

Results

==================================================
BENCHMARK RESULTS: alexnet on cuda
==================================================
Total images processed: 1600
Total inference time: 1.16 seconds
Throughput: 1375.37 images/second
Average batch inference time: 23.27 ms
Standard deviation of batch time: 72.24 ms
Min batch time: 12.00 ms
Max batch time: 528.88 ms
Average accuracy: 73.94%
==================================================

These results highlight several important observations:

High Throughput Potential: With GPU acceleration, AlexNet achieves a throughput of approximately 1,375 images per second, demonstrating the effectiveness of parallel processing for CNN workloads. In designing a hardware accelerator, ideally, the performance will be better than a GPU.
Latency Variability: The substantial difference between minimum (12.00 ms) and maximum (528.88 ms) batch times indicates performance inconsistency, likely due to initial GPU warm-up and memory transfer operations.
Accuracy Preservation: The model maintained a 73.94% accuracy on the ImageNet validation set, consistent with published benchmarks for AlexNet.

The performance measurements establish a baseline for comparison with potential custom hardware accelerators and highlight the computational demands of CNN inference. The high throughput achieved with GPU acceleration supports the proposition for specialized hardware architectures that can maintain or exceed this performance while potentially reducing power consumption, area, and/or cost.

2. Arithmetic Intensity Analysis

To further understand the computational requirements of AlexNet and identify opportunities for hardware acceleration, I analyzed the arithmetic intensity of each layer in the network. Arithmetic intensity is a critical metric that informs hardware design decisions by revealing whether a computation is compute-bound or memory-bound.

Background and Methodology

Arithmetic intensity is defined as the ratio of computational operations to memory operations for a given algorithm:

\text{Arithmetic Intensity} = \frac{\text{Number of Operations (FLOPs)}}{\text{Memory Accesses (Bytes)}}

This ratio helps identify whether a computation is limited by computational resources (high arithmetic intensity) or memory bandwidth (low arithmetic intensity). For convolutional layers, the FLOPs were calculated using:

\text{FLOPs}_{\text{conv}} = 2 \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{out}} \times C_{\text{in}} \times K^2

Where:

$H_{\text{out}}$ and $W_{\text{out}}$ are output height and width
$C_{\text{out}}$ and $C_{\text{in}}$ are output and input channels
$K$ is the kernel size
The factor of 2 accounts for both multiplication and addition in each MAC operation

For fully connected or linear layers:

\text{FLOPs}_{\text{fc}} = 2 \times N_{\text{in}} \times N_{\text{out}}

Where $N_{\text{in}}$ and $N_{\text{out}}$ are the input and output feature dimensions.

Memory accesses were calculated considering both reads (input features, weights, biases) and writes (output features), with each element requiring 4 bytes (for 32-bit floating-point values):

\text{Memory}_{\text{conv}} = 4 \times (B \times H_{\text{in}} \times W_{\text{in}} \times C_{\text{in}} + C_{\text{out}} \times C_{\text{in}} \times K^2 + C_{\text{out}} + B \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{out}})

\text{Memory}_{\text{fc}} = 4 \times (B \times N_{\text{in}} + N_{\text{in}} \times N_{\text{out}} + N_{\text{out}} + B \times N_{\text{out}})

Where $B$ is the batch size.

Results

The arithmetic intensity analysis reveals significant differences between convolutional and fully connected layers in AlexNet:

===== ARITHMETIC INTENSITY ANALYSIS =====
Layer                   | FLOPs (M)   | Memory (MB)  | Arithmetic Intensity (FLOPs/Byte)
----------------------------------------------------------------------------------
Conv2d 0                |     140.55 |       13.54 |      10.38
Conv2d 3                |     447.90 |        1.98 |     226.66
Conv2d 6                |     224.28 |        3.05 |      73.65
Conv2d 8                |     299.04 |        3.97 |      75.28
Conv2d 10               |     199.36 |        2.71 |      73.66
Linear 0                |      75.50 |      151.06 |       0.50
Linear 1                |      33.55 |       67.16 |       0.50
Linear 2                |       8.19 |       16.41 |       0.50

These results highlight several key insights:

Convolutional vs. Fully Connected Layers: Convolutional layers exhibit significantly higher arithmetic intensity (10.38-226.66 FLOPs/Byte) compared to fully connected layers (consistently 0.50 FLOPs/Byte). This stark difference explains why hardware accelerators often focus on optimizing convolutional operations.
Layer-Specific Variations: The second convolutional layer (Conv2d 3) shows exceptionally high arithmetic intensity (226.66 FLOPs/Byte), indicating it is heavily compute-bound and would benefit most from computational acceleration.
Memory Bottlenecks: Fully connected layers, with their low arithmetic intensity, are clearly memory-bound. This suggests that hardware accelerators should prioritize memory bandwidth and data reuse strategies for these layers.

This analysis explains why industry hardware accelerators like Google's TPU incorporate specialized matrix multiplication units with high data reuse for convolutional operations, while employing memory hierarchies and compression techniques to address the memory-bound nature of fully connected layers. The significant disparity in arithmetic intensity across different layers also motivates heterogeneous accelerator designs that can efficiently handle both compute-bound and memory-bound operations.

3. GPU Profiling Analysis

To gain deeper insights into the hardware utilization patterns during inference, I employed the PyTorch Profiler to analyze AlexNet's execution on GPU hardware. This profiling provides a detailed breakdown of time spent in various operations, helping identify potential bottlenecks and optimization opportunities.

Methodology

The profiling was performed on an NVIDIA GeForce RTX 3050 Laptop GPU using PyTorch's built-in profiling tools. The analysis captured both CPU and CUDA activities during a single inference pass with a batch size of 1. Prior to measurement, a warm-up pass was executed to ensure initialization overhead didn't affect the results.

Hardware Configuration

===== GPU DETECTION DIAGNOSTICS =====
PyTorch version: 2.6.0+cu124
CUDA available: True
Number of CUDA devices: 1

CUDA Device 0:
  Name: NVIDIA GeForce RTX 3050 Laptop GPU
  Capability: (8, 6)
  Total memory: 4.00 GB

Results

The profiling results reveal the distribution of computation time across various operations:

===== PYTORCH PROFILER RESULTS =====
Profiling on device: cuda
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     model_inference        20.92%       1.288ms        86.02%       5.298ms       5.298ms             1
                        aten::conv2d         0.71%      43.678us        54.48%       3.355ms     671.037us             5
                   aten::convolution         0.87%      53.355us        53.77%       3.312ms     662.302us             5
                  aten::_convolution         2.50%     153.839us        52.90%       3.258ms     651.631us             5
             aten::cudnn_convolution        27.89%       1.717ms        47.16%       2.904ms     580.900us             5
                     cudaEventRecord         0.36%      22.047us         0.36%      22.047us       4.409us             5
               cudaStreamIsCapturing         0.08%       5.219us         0.08%       5.219us       0.870us             6
               cudaStreamGetPriority         0.06%       3.753us         0.06%       3.753us       0.751us             5
    cudaDeviceGetStreamPriorityRange         0.05%       3.196us         0.05%       3.196us       0.639us             5
                    cudaLaunchKernel         9.01%     554.875us         9.01%     554.875us      18.496us            30

These results highlight several key insights:

Convolution Dominance: Convolutional operations account for over 54% of the total CPU time, confirming their significance in the overall computation. The CUDA implementation of convolution (aten::cudnn_convolution) alone consumes 27.89% of the self CPU time.
GPU Memory Operations: Memory allocation (cudaMalloc) takes a significant 12.59% of self CPU time, indicating that memory management is a non-trivial component of inference latency.
Kernel Launch Overhead: The cudaLaunchKernel operation accounts for 9.01% of self CPU time across 30 calls, suggesting that kernel launch overhead is a considerable factor in overall performance.
Layer Distribution: The CUDA profiling confirms the execution of 5 convolutional layers, which matches AlexNet's architecture as shown in the model exploration section.

These profiling results align with the arithmetic intensity analysis, confirming that convolutional operations dominate the computational landscape of AlexNet. The significant time spent in GPU memory operations and kernel launches also suggests that optimizing data movement and kernel scheduling could yield substantial performance improvements in hardware acceleration designs.

This detailed breakdown provides valuable insights for answering the Heilmeier questions related to current limitations and potential innovation opportunities in hardware acceleration for CNNs.

Summary of Results and Heilmeier Revisited

Based on my analysis of AlexNet, convolutional operations dominate the computational landscape with significantly higher arithmetic intensity (up to 226.66 FLOPs/Byte) compared to fully connected layers (0.50 FLOPs/Byte) and account for over 54% of total execution time. The compute-bound nature of convolutions, particularly in middle layers, presents the optimal target for hardware acceleration development. GPU profiling further confirms this assertion, revealing that CUDA convolution implementations consume nearly 28% of self CPU time. Therefore, my hardware accelerator development will focus on optimizing convolutional operations, which offer the greatest potential for performance improvements while addressing the most computationally intensive component of CNN inference.

Heilmeier Questions: Coming back to the original proposed questions:

What are you trying to do? I am developing a specialized hardware chip that can perform the mathematical calculations required for image recognition tasks much faster and using less energy than GPUs.
How is it done today, and what are the limits of current practice? Today, these tasks are primarily run on GPUs, which, while effective, consume significant power and computation bottlenecks when processing the convolutional operations that comprise the majority of CNN computation time.
What is new in your approach, and why do you think it will be successful? My approach will create optimized hardware for convolutional operations, focusing on the exact mathematical patterns these networks use rather than general-purpose computing.
Who cares? Companies deploying AI systems in data centers, edge devices, and mobile applications care deeply about reducing energy costs and increasing processing speed, which my accelerator would directly address.
What are the risks? The primary risks include: (1) Difficulty competing with highly optimized commercial GPUs that have undergone years of development specifically for AI workloads; (2) Architectural limitations when scaling from simple designs to a complete, functional accelerator; (3) Software compatibility challenges that could limit adoption even with better hardware performance; and (4) Finding the balance between specialization for convolutional operations and enough flexibility to support evolving neural network architectures.
How much will it cost?

$0 for design and simulation using Verilator
$100-$300 FPGA board for prototyping and validation
~$300 for physical manufacturing from Tiny Tapeout

How long will it take? Initial proof of concept will be completed in about 7 weeks.

Week 1-2: Architecture design and hardware specification- I'll be using existing literature and design for help with this.
Week 3-5: HDL implementation of core convolution accelerator components - likely consisting of a systolic array of parallel processing elements, and a convolution to column transformation.
Week 6-7: Simulation, testing, and documentation of results.

What are the mid-term and final "exams" to check for success?

Short-term success metric: Valid design with working proof of concept
Mid-term success metric: Simulation results showing improvement of image recognition throughput over GPUs
Long-term success metric: Validation that the design can achieve better power performance over GPUs

References:

[1] - N. P. Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit," in 44th International Symposium on Computer Architecture (ISCA), Toronto, Canada, June 26, 2017, pp. 1-12. [Online]. Available: https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf, Accessed on: Apr. 14, 2025.
[2] - A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional Neural Networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017. doi:10.1145/3065386