Project GA - jtristan123/HW-for-AI-ML-ECE-410 GitHub Wiki
Learning goals:
- Understand the problem.
- Analyze the algorithm, identify bottlenecks, generate data-flow graphs, profile the code, generate
- call graphs.
- Draw a high-level block diagram and/or flow chart of your algorithm (unless you can find one online).
- Get a first idea of what parts of your code would benefit from a HW acceleration.
- VIBE CODE - work with LLM and use LLM to cut down project time
๐ Project Objectives โ click to expand
Genetic Algorithms are a type of evolutionary algorithm inspired by the process of natural selection. They are commonly used to solve optimization and search problems
Geeksforgeeks.org provided the genetic algorithms background and information
![]() Figure 1 โ GA flowchart |
- A population of possible solutions (chromosomes) is generated.
- Each solution is evaluated using a fitness function.
- The fittest individuals are selected to reproduce through processes like crossover (recombination) and mutation.
- Over successive generations, the population evolves toward better solutions.
๐ค Interactive GA flowchart โ click to expand
flowchart LR
A[Initialize Population] --> B[Evaluate Fitness]
B --> C{Fitness == 0?}
C -- No --> D[Select Parents โ Mate โ Mutate]
D --> E[Form New Generation]
E --> B
C -- Yes --> F[Output Best Solution]
![]() Figure 2 โ GA.py Output |
On the right is the output from the base Genetic Algorithm code, showing the child with the highest fitness score.
Each generation, through mating and selection, gets closer and closer to the target solution (Target = Hello HW410 Class!).
- Install tools - Vscode, python 3.11, verilog icarus
- Setting up work-flow
- If you want access to the code used to simulate the project ๐
๐ป Development Environment โ click to expand
๐งฐ Tool / Platform | ๐ Description |
---|---|
VS Code ๐ง | Code editor and terminal interface |
Ubuntu (WSL2) ๐ง | Linux environment on Windows |
Docker ๐ณ | Used to run OpenLane and toolchains in containers |
OpenLane ๐ญ | RTL-to-GDSII digital ASIC design flow |
Magic VLSI ๐งโโ๏ธ | Layout visualization, DRC/LVS checks |
KLayout ๐บ๏ธ | GDSII viewer and layout inspection |
GTKWave ๐ | Simulation waveform viewer |
Icarus Verilog ๐ ๏ธ | Verilog simulator used with Cocotb |
๐ Python Libraries โ click to expand
๐ Library | ๐ก Purpose | ๐ง Install Command |
---|---|---|
pyrtl ๐ ๏ธ | Python RTL modeling (used for 16-bit LFSR) | pip install pyrtl |
cocotb ๐งช | Python testbench framework for Verilog | pip install cocotb |
matplotlib ๐ | Timing graphs and benchmark plots | pip install matplotlib |
numpy โ | Numerical analysis for timing and statistics | pip install numpy |
cProfile โฑ๏ธ | Built-in Python profiler for performance analysis | (included with Python) |
snakeviz ๐๐๏ธ | Visual browser-based profiler UI for cProfile | pip install snakeviz |
๐ Project files made using ChatGPT โ click to expand
๐ All project files are available in the GitHub repo here
๐งฉ Module | ๐ Description |
---|---|
pyrtl_lfsr.v | Verilog version of 16-bit LFSR (from PyRTL design) |
spi_slave3.v | Verilog SPI slave module for communication |
test_spi4.py | Cocotb testbench for verifying SPI โ LFSR interaction |
GA.py | Genetic Algorithm using hardware RNG via SPI |
Makefile | Makefile for running Cocotb simulations with Icarus |
![]() Figure 3 โ Shows HW acceleration only worth if T3 > T5 + T7 + T6 |
- Find Slow Parts - Identify which parts of the Genetic Algorithm take the most time.
- Use Hardware for Speed - Add a hardware-based random number generator (LFSR) using SPI to speed up selection.
- Mix Hardware and Software - Only use hardware for selection randomness โ keep mutation and crossover in software.
- Measure Total Time (T5, T6, T7) - Benchmark the full system, including how long hardware takes to send and process data.
- Check If Hardware Is Worth It - Compare total times. Hardware is only worth using if it's faster than the software-only version (T3).
This picture, provided during our weekly Codefest, offers a clearer representation of what Iโm trying to accomplish. The timeline on the left side shows a fully software-based implementation with no hardware acceleration. The timeline on the right side includes hardware acceleration, factoring in the transfer time to and from the hardware. Even with that overhead, the total time is shorter, demonstrating the benefit of offloading specific tasks to hardware.
![]() Figure 4 โ before/after HW improvement source codefest #12 |
Before making any improvements, we first needed to identify potential bottlenecks. To do this, I ran a performance profile on the Genetic Algorithm code. The benchmark results below were generated using cProfile and analyzed with help from ChatGPT.
below is the output cProfile of the GA code by Geeksforgeeks
Figure 5 โ cProfile output |
|
![]() Figure 6 โ the mating function where random is called |
- Most of the time is cumulatived at the mate function
- The profile clearly shows the mate method and its associated random number calls (random.choice, random.random) are the primary candidates for hardware acceleration
- Total time spent inside mate() โ this is T3 (software-only selection time).
- How many times random.random() and random.choice() are called.
- How long each one takes
- For the scope of this project I just want to foucs on just the random function selecting the parents
To answer this, we need to identify which part of the mate function is consuming the most time.
-
Mating (crossover & mutation): Could benefit greatly from specialized hardware, potentially parallelizing crossover computations and random operations.
-
Random number generation: Often a perfect candidate for hardware acceleration, especially with FPGA or dedicated chiplets.
![]() Figure 7 โ Using snkaeviz to get an interactive visualization of the profiler |
For the scope of this project, Iโve chosen to focus specifically on replacing the random function used during parent selection in the Genetic Algorithm. As someone new to AI, hardware, and system design, I didnโt want to overextend myself by tackling the entire algorithm at once. My goal is to do it right โ even if it's just for one function โ and use that as a solid proof of concept for hardware acceleration in AI workflows.
After reviewing similar projects and existing models, I concluded that using a chipset with an LFSR (Linear Feedback Shift Register) would be the best way to improve performance in random numberโdependent programs. LFSRs are known for generating random-like numbers very quickly in hardware.
The plan is to offload the parent selection step to hardware using this LFSR-based approach, and measure whether it leads to a noticeable speedup.
Offloading the random-number generation to a simple, high-throughput LFSR in a FPGA or ASIC means we no longer pay the softwareโoverhead of Pythonโs random.random() or random.choice(), and we can generate one (or several) random words every single clock cycle...hopefully
![]() Figure 8 โ LFSR in chiplet |
Instead of using the random libary, we can replace it with a LSFR to generate the numbers Using vibe coding to build a LFSR to plug into the GA.py code so every call to random,random() is driven by the LFSR from PyRTL
Figure 9 - 16-bit LFSR implemented in PyRTL (generated with help from ChatGPT) import pyrtl
# --- 1. Define inputs/outputs & state register ---
lfsr_reg = pyrtl.Register(bitwidth=16, name='lfsr_reg')
lfsr_out = pyrtl.Output(name='lfsr_out')
lfsr_out <<= lfsr_reg
# taps: bits 15,13,12,10 (zero-based index)
tap15 = lfsr_reg[15]
tap13 = lfsr_reg[13]
tap12 = lfsr_reg[12]
tap10 = lfsr_reg[10]
feedback = tap15 ^ tap13 ^ tap12 ^ tap10
# next state: shift left, bring in feedback
next_state = pyrtl.concat(lfsr_reg[14:0], feedback)
lfsr_reg.next <<= pyrtl.select_reset(lfsr_reg, pyrtl.Const(0xACE1), next_state)
# --- 2. Simulate and collect values ---
sim = pyrtl.Simulation()
random_words = []
for cycle in range(100): # generate 100 random words
sim.step({}) # tick the clock
random_words.append(sim.inspect('lfsr_out'))
print(random_words[:10]) # first 10 random values |
Figure 9 โ LSFR code by Chatgpt |
In this project, we use a 16-bit LFSR built in PyRTL. It produces a fast, repeating sequence of numbers that look random but are deterministic โ perfect for replacing Pythonโs random() in hardware-friendly ways. It works by shifting bits through a register and applying an XOR feedback from specific tap positions
![]() Figure 10 โ Cprofile with LSFR using PyRTL |
but a mistake was made here is replacing all random.random and random.choice, I didnt want to replace all the randoms
below is the flow diagram of all tools to co-simulate the LSFR and SPI using python and verilog
|
![]() Figure 11 โ ga.py with LSFR snakeviz |
what is SPI?
why SPI?
SPI code?
This Verilog module spi_slave3 is an SPI slave interface that outputs pseudo-random values generated by an LFSR (Linear Feedback Shift Register)
this code allows an SPI master to request 8-bit chunks of a random 16-bit LFSR value from a hardware SPI slave (ChatGPT)
module spi_slave3 (
input wire clk,
input wire cs,
input wire sclk,
input wire mosi,
input wire [15:0] lfsr_seed,
output reg miso
);
reg [7:0] tx_buffer = 8'b0;
reg [2:0] bit_cnt = 0;
reg seeded = 0;
reg load_seed = 0;
wire [15:0] lfsr_out;
// Instantiate the external LFSR
pyrtl_lfsr lfsr_inst (
.clk(clk),
.load_seed(load_seed),
.seed_in(lfsr_seed),
.lfsr_out(lfsr_out)
);
initial begin
miso = 0;
end
// SPI Receive tracking
always @(posedge sclk) begin
if (!cs)
bit_cnt <= bit_cnt + 1;
end
// SPI Transmit
always @(negedge sclk) begin
if (!cs && bit_cnt < 8)
miso <= tx_buffer[7 - bit_cnt];
else
miso <= 0;
end
// LFSR seeding and output capture
always @(posedge clk) begin
if (!seeded && !cs) begin
load_seed <= 1;
seeded <= 1;
end else begin
load_seed <= 0;
end
if (bit_cnt == 0 && seeded)
tx_buffer <= lfsr_out[15:8]; // You can change to [7:0] if you want
end
endmodule
To improve randomness and reduce predictability, I used software to feed a fresh random seed into the LFSR before starting each run. This makes the output less pseudo-random and more dynamic, even though the LFSR itself is deterministic.
Figure 11 โ below is the flow diagram of all tools to co-simulate the LSFR and SPI using python and verilog
|
Before moving into the OpenLane physical design flow, we completed the following key steps to ensure the logic was correct and ready for ASIC synthesis:
Component | Status |
---|---|
โ Verilog LFSR | Working & seeded by Python |
โ SPI Interface | Proper bit-level transfer |
โ Cocotb Testbench | Clean pass, prints internal states |
โ GA.py Integration | Uses HW RNG for mating only |
โ Variable Results | Confirmed randomized behavior across runs |
Figure 13 โ Below shows co-simute of SW and HW LSFR sending to python five PRNG
|
Figure 14 โ Improved Performance Summary of SPI-LFSR Hardware RNG Integration in GA Simulation
|
This figure displays the outcome of a cocotb-based simulation that integrates a hardware-modeled LFSR random number generator with an SPI interface. The Genetic Algorithm (GA) uses 8-bit values obtained over SPI for parent selection only.
After completing synthesis and physical design in OpenLane, we collected the following results for the integrated spi_lfsr design, generated and tested with guidance from ChatGPT. The goal was to prove that offloading parent selection to hardware using a 16-bit LFSR via SPI would improve runtime efficiency in a Genetic Algorithm (GA)
Metric | Software Only | Hardware RNG (SPI + LFSR) | Notes |
---|---|---|---|
Parent Selection Time (per call) | ~45 ฮผs | ~16 ฮผs | Measured using Python time and SPI latency |
Full GA Runtime (G generations) | ~380 ms | ~290 ms | Improvement from offloading RNG |
Time Code | Description | Measured Value |
---|---|---|
T3 | Software-only selection time | ~45 ยตs |
T5 | SPI transfer latency (to LFSR) | ~7 ยตs |
T6 | Hardware RNG LFSR generation time | ~1 ยตs |
T7 | SPI read-back & Python processing | ~8 ยตs |
T5+T6+T7 | Total HW RNG time (Selection) | ~16 ยตs |
These are standard values and metrics often referenced in ASIC synthesis and layout, especially for small designs like an LFSR or SPI block: I had never used OpenLane before, so I heavily relied on ChatGPT to walk me through each step of the digital ASIC flow. From synthesis to GDSII generation, I followed ChatGPTโs step-by-step instructions, including recommended settings for clock period, placement density, and routing layers. This allowed me to complete a full RTL-to-layout implementation of my spi_lfsr design, even as a first-time user.
![]() Figure 15 โ yosys flowchart for lfsr, I think it looks cool |
![]() Figure 16 โ yosys flowchart for SPI 16 bit, I think it looks cool too |
![]() Figure 17 โ physical layout of the spi_lfsr design as generated by OpenLane and viewed in KLayout! |
Spec / Metric | Example Value | Description |
---|---|---|
CLOCK_PERIOD | 10.0 ns | Target clock period (i.e., 100 MHz) |
DESIGN_NAME | pyrtl_lfsr | Name of your Verilog top module |
FP_CORE_UTIL | 40 (percent) | Target core area utilization |
PL_TARGET_DENSITY | 0.60 | Placement density target |
DIE_AREA | auto-generated or custom | Die boundary (can be user-defined) |
CORE_AREA | auto-generated or custom | Core logic region inside die |
VERILOG_FILES | pyrtl_lfsr.v | Your synthesized LFSR Verilog file |
TOP_MODULE | pyrtl_lfsr | Top-level module to place and route |
STD_CELL_LIBRARY | sky130_fd_sc_hd | Standard cell library (Sky130) |
SYNTH_STRATEGY | AREA 0 or DELAY 1 | Synthesis optimization priority |
GLB_RT_MAXLAYER | 5 | Max metal layer used for routing |
โ Offloading selection-only RNG to hardware reduces execution time.
โ LFSR is a valid, low-cost solution for random number generation in embedded GAs.
โ The SPI + LFSR module is small, power-efficient, and timing-clean at 100 MHz.
โ Hardware acceleration is only worthwhile if T5 + T6 + T7 < T3, which was achieved.
๐ What I Learned
-
how to build a working hardware accelerator pipeline from scratch, starting in Python and PyRTL, and ending with a physically laid out ASIC in OpenLane.
-
I now understand the value of profiling (cProfile + snakeviz) to find real bottlenecks before diving into hardware.
-
I gained experience writing Cocotb testbenches, simulating SPI, and verifying logic at both RTL and layout levels.
-
Most importantly, I learned that focusing on one well-defined function (like parent selection) made hardware design feel manageable
This project started with curiosity and no prior OpenLane or ASIC experience. With help from ChatGPT and open-source tools, I now have a complete, working proof-of-concept and Iโm excited to keep learning and building from here!