Home - reecewayt/llm-assisted-design-portfolio GitHub Wiki

LLM-Assisted Hardware Design Portfolio

Welcome to my documentation wiki for exploring LLM-assisted hardware design. This portfolio demonstrates how modern AI tools can accelerate the development of specialized computing architectures, focusing on CNN acceleration through parallel processing arrays.

Disclaimer: This project extensively leverages LLM assistance for code generation, design exploration, and documentation.

Background

🎯Core Project: Parallel Matrix Processing Array

My design process began with identifying performance bottlenecks in CNN workloads through benchmarking of the AlexNet architecture. As expected, matrix multiplication operations emerged as the primary computational constraint. This led to my core design goal: accelerate matrix operations in hardware to outperform software-based implementations.

To address this challenge, I developed an architecture for matrix operations:

Parallel Matrix Processing Array: A broadcast-based architecture where input vectors are distributed simultaneously to all processing elements. Each PE independently computes one element of the result matrix using dedicated MAC units. This design prioritizes simplicity and predictable timing over the data reuse advantages of systolic arrays.

Key Design Decisions

Precision Selection: 8-bit Arithmetic

I chose 8-bit precision based on industry trends toward lower numerical precision for deep learning inference. As demonstrated by Rodriguez et al. in their Intel white paper on "Lower Numerical Precision Deep Learning Inference and Training"¹, researchers have shown that 16-bit multipliers for training and 8-bit multipliers for inference achieve minimal to no loss in accuracy while providing significant benefits such as:

Improved memory bandwidth utilization - Reduces bandwidth bottlenecks that limit performance
Better cache efficiency - More data fits in limited cache resources
Hardware efficiency gains - Smaller multipliers require less silicon area and power

This trend toward 8-bit inference has become standard across the industry, making it the natural choice for my accelerator design.

Architecture Selection: Integer vs. Floating-Point:

I initially explored both floating-point and integer implementations to understand the performance trade-offs:

Floating-Point (E4M3) Results:

2×2 matrix array: Achieved 104 MFLOPS
Performance: Actually slower than software implementation
Conclusion: Too much overhead for small matrix operations

Integer Implementation Results:

3×3 matrix array: Achieved 648 Million Operations Per Second (MOPS)
Performance: 3.2x speedup over optimized software baseline
Inspiration: Follows Google's TPU architecture philosophy of using 8-bit integer multipliers for high throughput (TPU achieves ~92 TeraOPS)

The integer approach proved superior for this application scale, demonstrating that simpler arithmetic can achieve better performance for inference workloads.

Key Achievements:

✅ Two synthesized implementations: 8-bit E4M3 floating-point (2×2) and 8-bit integer (3×3)
✅ Complete ASIC flow from RTL to GDSII using OpenLane 2 (for both designs)
✅ Custom testing framework with assertion-based verification and waveform generation
⭐ Performed validation against software baselines, achieving a 3.2x speedup over software

🔗View summary of results here -->.

Course Context

Course: ECE 510: Hardware for AI/ML
Institution:Portland State University
Term: Spring 2025

Course Description (From Syllabus): "Hardware (HW) is the foundation upon which artificial intelligence (AI) and machine learning (ML) systems are built. It provides the necessary computational power, efficiency, and flexibility to drive innovation in these emerging fields. By using HW/SW co-design, students will learn how to use, design, simulate, optimize, and evaluate specialized HW, such as GPUs, TPUs, FPGAs, and neuromorphic chips, for modern AI/ML algorithms. The intersection of HW and AI/ML is a rapidly growing field with significant career opportunities for computer engineers."

Getting Started

If you are interested in learning more about the GEMM Unit I've developed start here.

Current Projects

A list of all projects done during this 10 week course:

SpikingNeuronArray - Implementation of a spiking neuron array in SystemVerilog, this coding challenge was inspired from "Designing Silicon Brains using LLM: Leveraging ChatGPT for Automated Description of a Spiking Neuron Array".
AlexNetProfiling - Benchmarks the AlexNet Architecture at real-time inference tasks to determine possible software bottlenecks. As expect, the result of these tests painted a clear picture that the convolution operations were the most computationally intensive that could benefit from specialized hardware.
CNN-Accelerator - Main project of the course- a broadcast-based parallel processing array.
E4M3Background - Explains E4M3 floating point format from Nvidia, and describes how to implement an adder and multiplier in hardware.

Development Methodology

Development-Process - Overview of my design and implementation workflow
LLM-Assistance-Methodology - How I leverage LLMs in the hardware design process

References

[1] Rodriguez, A., Segal, E., Meiri, E., Fomenko, E., Kim, Y. J., Shen, H., & Ziv, B. (2018). Lower Numerical Precision Deep Learning Inference and Training. Intel Corporation White Paper.
[2] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., ... & Yoon, D. H. (2017). In-datacenter performance analysis of a tensor processing unit. Proceedings of the 44th Annual International Symposium on Computer Architecture, 1-12.