GEMMUnitProject - reecewayt/llm-assisted-design-portfolio GitHub Wiki
This page describes my project goals and reports on the results achieved through hardware synthesis. Physical hardware was not manufactured, so these are baseline measurements to conclude if my design has the "possibility" to beat software implementations. Hence, these results are provided as proof of concept, rather then a full implementation. For comparability, I wrote a Python script to perform similar operations to my hardware design. All hardware metrics came from OpenLane2 Synthesis final report metrics.
Project description: This project is part of a coding challenge to create an accelerator for some AI/ML workload or algorithm. What I designed are two prototypes for a parallel processing array. One for 8-bit floating point, and one for 8-bit integer operations. At this link is a description of the floating-point format I used. The workload I set to accelerate here is CNNs, but this project focuses solely on the processing array with the assumption that an img2column transformation is applied to an input feature map.
Please see git submodule for 🔗source code
The synthesis results demonstrate distinct performance characteristics between the integer and floating-point processing arrays. The floating-point design achieves a higher maximum clock frequency (104 MHz vs 72 MHz) but consumes significantly more power and area resources.
Metric | Integer Array | FP Array | Ratio (FP/Int) |
---|---|---|---|
Power | 1.45 mW | 3.97 mW | 2.7x |
Max Clk Rate | 72 MHz | 104 MHz | 1.4x |
Area | 90 mm² | 127 mm² | 1.4x |
Gates | 4,898 | 7,948 | 1.6x |
Power per MHz | 20 µW/MHz | 38 µW/MHz | 1.9x |
Computational Throughput Analysis: Based on simulated timing results, the integer design completes operations in 1 cycle while the floating-point design requires 9 cycles on average. For comparison purposes, both throughput calculations assume a 3×3 processing array configuration:
- 🟢 Integer Array: 648 Mega OPs (72 MHz × 1 op/cycle × 9 PEs)
- 🔴 Floating-Point Array: 104 Mega FLOPs (104 MHz ÷ 9 cycles/op × 9 PEs)
Note: The actual floating-point implementation used a 2×2 array, but throughput is normalized to a 3×3 configuration for direct comparison with the integer design.
The results indicate that while the floating-point design offers higher precision for AI/ML workloads, it comes with substantial overhead in power consumption (2.7×) and computational latency (9× cycles per operation), making the integer design more suitable for power-constrained applications where 8-bit precision is sufficient.
⭐ Main Design Object: My main goal for this project was to develop a hardware device that has the potential to beat software. To test this I developed a Python benchmarking script that evaluates multiple CPU-based matrix multiplication approaches against the hardware target of 648 MOPS. The benchmark focuses specifically on 3×3 matrix operations to match the hardware processing array dimensions.
Benchmark Methodology: This Python script tests six different implementation approaches, each representing different levels of optimization commonly found in software:
- Pure Python (nested loops) - Basic triple-nested loop implementation representing unoptimized algorithmic approaches
- Unrolled Python - Manually unrolled loops eliminating loop overhead for small fixed-size matrices
- NumPy @ operator (float32) - Modern Python matrix multiplication using 32-bit floating point
- NumPy @ operator (int8) - Same operator but with 8-bit integers matching hardware data types
- NumPy dot function (float32) - Traditional NumPy dot product implementation
- NumPy with pre-allocated output - Optimized version eliminating memory allocation overhead during computation
Each benchmark separates data access time from pure computation time to isolate the computational performance from memory overhead. The script disables NumPy multithreading to ensure a fair single-core CPU comparison against the hardware design. Tests run on a 14-core M4 Pro (Apple Silicon) with 24GB RAM.
The benchmark results demonstrate a clear performance hierarchy, with the best software implementation achieving 205.2 MOPS in pure compute performance:
Method | Total MOPS (including data access) | Compute MOPS | vs Hardware Ratio |
---|---|---|---|
Pure Python | 24.3 | 24.7 | 0.038x |
Unrolled Python | 49.4 | 51.0 | 0.079x |
NumPy @ (float32) | 117.3 | 126.5 | 0.195x |
NumPy @ (int8) | 130.6 | 141.1 | 0.218x |
NumPy dot (float32) | 162.6 | 180.8 | 0.279x |
NumPy pre-allocated | 181.2 | 205.2 | 0.317x |
Note: Manual loop unrolling provides a 2× improvement over basic Python loops, while leveraging optimized libraries (NumPy) yields an additional 4× improvement, demonstrating the value of both algorithmic and implementation optimization.
The integer processing array achieves 648 M OPS, representing a 3.2× speedup over the best software implementation (NumPy pre-allocated at 205.2 MOPS). ⭐This demonstrates that the integer hardware design successfully meets its performance objectives.
The floating point array achieves 104 M FLOPS, which only beats the least optimized software implementations. This clearly shows the limitation of doing floating point in hardware. I knew shortly after synthesis that this would be the case and is why I developed an integer based implementation as well.
Note: The theoretical throughput of 648 Mega OPs would be highly dependent on optimized data paths and dedicated on-chip storage for the final design. During my short 10 weeks, I didn't have time to design the entire chiplet, so these number are preliminary results, provided as proof of concept.
The current GEMM processing array represents the computational core of a full CNN accelerator system, but several critical components must be developed to create a complete device. Data Path Architecture: A complete implementation would require an integrated memory hierarchy with dedicated on-chip SRAM for weight storage (similar to SPOTS's 1MB filter SRAM) and feature map buffering. The system should include a hardware-based Im2Col transformation unit that streams input feature maps and generates patches on-the-fly, eliminating the redundant memory accesses inherent in software-based approaches.
Host Interface and Communication: The accelerator would interface with the host system through a PCIe connection, receiving preprocessed CNN model weights during initialization and streaming image/feature map data during inference. The device should implement local weight sorting and possibly compression for sparse data. A dedicated DMA controller would manage data transfers, while an on-chip control processor would coordinate layer-by-layer execution, handle different CNN topologies, and manage the dynamic reconfiguration of processing elements for varying layer dimensions.
The design approach for this future work draws significant inspiration from the SPOTS accelerator architecture presented in "An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-matrix Multiplication" by Soltaniyeh et al., particularly their integrated Im2Col and GEMM pipeline design and sparse data handling techniques.