Project GA - jtristan123/HW-for-AI-ML-ECE-410 GitHub Wiki

Genetic-Algorithm

PROJECT INTRO

Learning goals:

  • Understand the problem.
  • Analyze the algorithm, identify bottlenecks, generate data-flow graphs, profile the code, generate
  • call graphs.
  • Draw a high-level block diagram and/or flow chart of your algorithm (unless you can find one online).
  • Get a first idea of what parts of your code would benefit from a HW acceleration.
  • VIBE CODE - work with LLM and use LLM to cut down project time

Project objectives

๐Ÿ Project Objectives โ€” click to expand

Algorithm overview/background

Genetic Algorithms are a type of evolutionary algorithm inspired by the process of natural selection. They are commonly used to solve optimization and search problems

The code can be found at this link

Geeksforgeeks.org provided the genetic algorithms background and information

GA mating
Figure 1 โ€” GA flowchart

GAs work by mimicking biological evolution:

  • A population of possible solutions (chromosomes) is generated.
  • Each solution is evaluated using a fitness function.
  • The fittest individuals are selected to reproduce through processes like crossover (recombination) and mutation.
  • Over successive generations, the population evolves toward better solutions.

๐Ÿค– Interactive GA flowchart โ€” click to expand
flowchart LR
  A[Initialize Population] --> B[Evaluate Fitness]
  B --> C{Fitness == 0?}
  C -- No --> D[Select Parents โ†’ Mate โ†’ Mutate]
  D --> E[Form New Generation]
  E --> B
  C -- Yes --> F[Output Best Solution]
Loading

Output

output
Figure 2 โ€” GA.py Output

On the right is the output from the base Genetic Algorithm code, showing the child with the highest fitness score.
Each generation, through mating and selection, gets closer and closer to the target solution (Target = Hello HW410 Class!).


Now that we know bit about the code, lets try to improve it using hardware!


โธ๏ธ Before you start... guide

  • Install tools - Vscode, python 3.11, verilog icarus
  • Setting up work-flow
  • If you want access to the code used to simulate the project ๐Ÿ‘‡

Tools used:

๐Ÿ’ป Development Environment โ€” click to expand
๐Ÿงฐ Tool / Platform ๐Ÿ” Description
VS Code ๐Ÿง  Code editor and terminal interface
Ubuntu (WSL2) ๐Ÿง Linux environment on Windows
Docker ๐Ÿณ Used to run OpenLane and toolchains in containers
OpenLane ๐Ÿญ RTL-to-GDSII digital ASIC design flow
Magic VLSI ๐Ÿง™โ€โ™‚๏ธ Layout visualization, DRC/LVS checks
KLayout ๐Ÿ—บ๏ธ GDSII viewer and layout inspection
GTKWave ๐Ÿ“ˆ Simulation waveform viewer
Icarus Verilog ๐Ÿ› ๏ธ Verilog simulator used with Cocotb
๐Ÿ Python Libraries โ€” click to expand
๐Ÿ“š Library ๐Ÿ’ก Purpose ๐Ÿ”ง Install Command
pyrtl ๐Ÿ› ๏ธ Python RTL modeling (used for 16-bit LFSR) pip install pyrtl
cocotb ๐Ÿงช Python testbench framework for Verilog pip install cocotb
matplotlib ๐Ÿ“Š Timing graphs and benchmark plots pip install matplotlib
numpy โž• Numerical analysis for timing and statistics pip install numpy
cProfile โฑ๏ธ Built-in Python profiler for performance analysis (included with Python)
snakeviz ๐Ÿ๐Ÿ‘๏ธ Visual browser-based profiler UI for cProfile pip install snakeviz
๐Ÿ“‚ Project files made using ChatGPT โ€” click to expand

๐Ÿ“‚ All project files are available in the GitHub repo here

๐Ÿงฉ Module ๐Ÿ” Description
pyrtl_lfsr.v Verilog version of 16-bit LFSR (from PyRTL design)
spi_slave3.v Verilog SPI slave module for communication
test_spi4.py Cocotb testbench for verifying SPI โ†” LFSR interaction
GA.py Genetic Algorithm using hardware RNG via SPI
Makefile Makefile for running Cocotb simulations with Icarus

Project goal

software with HW acceleration
Figure 3 โ€” Shows HW acceleration only
worth if T3 > T5 + T7 + T6

  • Find Slow Parts - Identify which parts of the Genetic Algorithm take the most time.
  • Use Hardware for Speed - Add a hardware-based random number generator (LFSR) using SPI to speed up selection.
  • Mix Hardware and Software - Only use hardware for selection randomness โ€” keep mutation and crossover in software.
  • Measure Total Time (T5, T6, T7) - Benchmark the full system, including how long hardware takes to send and process data.
  • Check If Hardware Is Worth It - Compare total times. Hardware is only worth using if it's faster than the software-only version (T3).


This picture, provided during our weekly Codefest, offers a clearer representation of what Iโ€™m trying to accomplish. The timeline on the left side shows a fully software-based implementation with no hardware acceleration. The timeline on the right side includes hardware acceleration, factoring in the transfer time to and from the hardware. Even with that overhead, the total time is shorter, demonstrating the benefit of offloading specific tasks to hardware.

before/after HW improvement<br>source codefest #12
Figure 4 โ€” before/after HW improvement
source codefest #12

Profile/benchmark raw algorithm

Finding where to improve

Before making any improvements, we first needed to identify potential bottlenecks. To do this, I ran a performance profile on the Genetic Algorithm code. The benchmark results below were generated using cProfile and analyzed with help from ChatGPT.

below is the output cProfile of the GA code by Geeksforgeeks

Figure 5 โ€” cProfile output

   Ordered by: cumulative time
   List reduced from 136 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000    0.398    0.398 {built-in method builtins.exec}
        1    0.000    0.000    0.398    0.398 GA.py:1(<module>)
        1    0.021    0.021    0.396    0.396 GA.py:86(main)
    15030    0.125    0.000    0.289    0.000 GA.py:42(mate)
    62326    0.038    0.000    0.106    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\random.py:341(choice)
    32266    0.010    0.000    0.065    0.000 GA.py:24(mutated_genes)
    62326    0.033    0.000    0.053    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\random.py:242(_randbelow_with_getrandbits)
   315786    0.047    0.000    0.047    0.000 {method 'append' of 'list' objects}
   300600    0.033    0.000    0.033    0.000 {method 'random' of '_random.Random' objects}
    15130    0.005    0.000    0.025    0.000 GA.py:20(__init__)
the mating function where random is called
Figure 6 โ€” the mating function where random is called

  • Most of the time is cumulatived at the mate function
  • The profile clearly shows the mate method and its associated random number calls (random.choice, random.random) are the primary candidates for hardware acceleration
  • Total time spent inside mate() โž this is T3 (software-only selection time).
  • How many times random.random() and random.choice() are called.
  • How long each one takes
  • For the scope of this project I just want to foucs on just the random function selecting the parents

Hw accelerator

Where should I consider HW acceleration?

To answer this, we need to identify which part of the mate function is consuming the most time.

  • Mating (crossover & mutation): Could benefit greatly from specialized hardware, potentially parallelizing crossover computations and random operations.

  • Random number generation: Often a perfect candidate for hardware acceleration, especially with FPGA or dedicated chiplets.

Using Snkaeviz

the mating function where random is called
Figure 7 โ€” Using snkaeviz to get an interactive visualization of the profiler

โœ‹ Project Scope Notes

For the scope of this project, Iโ€™ve chosen to focus specifically on replacing the random function used during parent selection in the Genetic Algorithm. As someone new to AI, hardware, and system design, I didnโ€™t want to overextend myself by tackling the entire algorithm at once. My goal is to do it right โ€” even if it's just for one function โ€” and use that as a solid proof of concept for hardware acceleration in AI workflows.

After reviewing similar projects and existing models, I concluded that using a chipset with an LFSR (Linear Feedback Shift Register) would be the best way to improve performance in random numberโ€“dependent programs. LFSRs are known for generating random-like numbers very quickly in hardware.

The plan is to offload the parent selection step to hardware using this LFSR-based approach, and measure whether it leads to a noticeable speedup.

WHY LFSR?

Offloading the random-number generation to a simple, high-throughput LFSR in a FPGA or ASIC means we no longer pay the softwareโ€overhead of Pythonโ€™s random.random() or random.choice(), and we can generate one (or several) random words every single clock cycle...hopefully

LFSR in chiplet
Figure 8 โ€” LFSR in chiplet

Instead of using the random libary, we can replace it with a LSFR to generate the numbers Using vibe coding to build a LFSR to plug into the GA.py code so every call to random,random() is driven by the LFSR from PyRTL


LFSR code using PyRTL


Figure 9 - 16-bit LFSR implemented in PyRTL (generated with help from ChatGPT)

import pyrtl

# --- 1. Define inputs/outputs & state register ---
lfsr_reg = pyrtl.Register(bitwidth=16, name='lfsr_reg')
lfsr_out = pyrtl.Output(name='lfsr_out')
lfsr_out <<= lfsr_reg

# taps: bits 15,13,12,10  (zero-based index)
tap15 = lfsr_reg[15]
tap13 = lfsr_reg[13]
tap12 = lfsr_reg[12]
tap10 = lfsr_reg[10]
feedback = tap15 ^ tap13 ^ tap12 ^ tap10

# next state: shift left, bring in feedback
next_state = pyrtl.concat(lfsr_reg[14:0], feedback)
lfsr_reg.next <<= pyrtl.select_reset(lfsr_reg, pyrtl.Const(0xACE1), next_state)

# --- 2. Simulate and collect values ---
sim = pyrtl.Simulation()
random_words = []
for cycle in range(100):         # generate 100 random words
    sim.step({})                # tick the clock
    random_words.append(sim.inspect('lfsr_out'))

print(random_words[:10])         # first 10 random values


Figure 9 โ€” LSFR code by Chatgpt

In this project, we use a 16-bit LFSR built in PyRTL. It produces a fast, repeating sequence of numbers that look random but are deterministic โ€” perfect for replacing Pythonโ€™s random() in hardware-friendly ways. It works by shifting bits through a register and applying an XOR feedback from specific tap positions


Hook it all together

Cprofile with LSFR using PyRTL

Cprofile with LSFR using PyRTL
Figure 10 โ€” Cprofile with LSFR using PyRTL

but a mistake was made here is replacing all random.random and random.choice, I didnt want to replace all the randoms

below is the flow diagram of all tools to co-simulate the LSFR and SPI using python and verilog

Sat May 10 23:02:46 2025    profile_after1.prof

         6284170 function calls (6255459 primitive calls) in 2.503 seconds

   Ordered by: cumulative time
   List reduced from 488 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     36/1    0.000    0.000    2.503    2.503 {built-in method builtins.exec}
        1    0.000    0.000    2.503    2.503 GA_LSFR.py:1(<module>)
        1    0.006    0.006    2.423    2.423 GA_LSFR.py:70(main)
    28408    0.019    0.000    2.369    0.000 GA_LSFR.py:28(hw_rand)
    28408    0.218    0.000    2.317    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyrtl\simulation.py:188(step)
     3780    0.020    0.000    1.731    0.000 GA_LSFR.py:55(mate)
   482936    0.839    0.000    1.692    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyrtl\simulation.py:432(_execute)
  1278977    0.287    0.000    0.445    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyrtl\wire.py:157(__hash__)
   482936    0.131    0.000    0.238    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyrtl\simulation.py:441(<genexpr>)
   482936    0.140    0.000    0.233    0.000 C:\Users\Skyline V2\AppData\Local\Programs\Python\Python312\Lib\site-packages\pyrtl\simulation.py:422(_sanitize)
ga.py with LSFR snakeviz
Figure 11 โ€” ga.py with LSFR snakeviz

Data transfers/communication

Dont forget about SPI

what is SPI? why SPI? SPI code?
This Verilog module spi_slave3 is an SPI slave interface that outputs pseudo-random values generated by an LFSR (Linear Feedback Shift Register) this code allows an SPI master to request 8-bit chunks of a random 16-bit LFSR value from a hardware SPI slave (ChatGPT)

module spi_slave3 (
    input wire clk,
    input wire cs,
    input wire sclk,
    input wire mosi,
    input wire [15:0] lfsr_seed,
    output reg miso
);

    reg [7:0] tx_buffer = 8'b0;
    reg [2:0] bit_cnt = 0;
    reg seeded = 0;
    reg load_seed = 0;

    wire [15:0] lfsr_out;

    // Instantiate the external LFSR
    pyrtl_lfsr lfsr_inst (
        .clk(clk),
        .load_seed(load_seed),
        .seed_in(lfsr_seed),
        .lfsr_out(lfsr_out)
    );

    initial begin
        miso = 0;
    end

    // SPI Receive tracking
    always @(posedge sclk) begin
        if (!cs)
            bit_cnt <= bit_cnt + 1;
    end

    // SPI Transmit
    always @(negedge sclk) begin
        if (!cs && bit_cnt < 8)
            miso <= tx_buffer[7 - bit_cnt];
        else
            miso <= 0;
    end

    // LFSR seeding and output capture
    always @(posedge clk) begin
        if (!seeded && !cs) begin
            load_seed <= 1;
            seeded <= 1;
        end else begin
            load_seed <= 0;
        end

        if (bit_cnt == 0 && seeded)
            tx_buffer <= lfsr_out[15:8];  // You can change to [7:0] if you want
    end

endmodule

๐ŸŒฑ Donโ€™t Forget the Seed!

To improve randomness and reduce predictability, I used software to feed a fresh random seed into the LFSR before starting each run. This makes the output less pseudo-random and more dynamic, even though the LFSR itself is deterministic.

Testing flow

Figure 11 โ€” below is the flow diagram of all tools to co-simulate the LSFR and SPI using python and verilog

You run:           make
        โ†“
Makefile calls:    cocotb
        โ†“
Cocotb launches:   Icarus Verilog (simulator)
        โ†“
Icarus compiles and runs:
    - spi_slave.v (your Verilog module)
    - hooks in cocotbโ€™s Python logic

        โ†“
Cocotb executes:
    - test_spi.py (your Python test code)
    - this drives & checks SPI behavior live

Verification

Before moving into the OpenLane physical design flow, we completed the following key steps to ensure the logic was correct and ready for ASIC synthesis:

Component Status
โœ… Verilog LFSR Working & seeded by Python
โœ… SPI Interface Proper bit-level transfer
โœ… Cocotb Testbench Clean pass, prints internal states
โœ… GA.py Integration Uses HW RNG for mating only
โœ… Variable Results Confirmed randomized behavior across runs

Output

Figure 13 โ€” Below shows co-simute of SW and HW LSFR sending to python five PRNG

Seeded LFSR with: 00001100
LFSR Output  test_spi4.py[0]: 00001100
LFSR Output  test_spi4.py[1]: 00001100
LFSR Output  test_spi4.py[2]: 00011001
LFSR Output  test_spi4.py[3]: 00110010
LFSR Output  test_spi4.py[4]: 01100100
  1070.00ns INFO     cocotb.regression                  spi_send_receive passed
  1070.00ns INFO     cocotb.regression                  **************************************************************************************
                                                        ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                        **************************************************************************************
                                                        ** test_spi4.spi_send_receive     PASS        1070.00           0.00     436015.69  **
                                                        **************************************************************************************
                                                        ** TESTS=1 PASS=1 FAIL=0 SKIP=0               1070.00           0.01     190731.38  **
                                                        **************************************************************************************

Figure 14 โ€“ Improved Performance Summary of SPI-LFSR Hardware RNG Integration in GA Simulation

HW RNG (mating): 184
HW RNG (mating): 238
Generation: 108 String: hello HW410 classW      Fitness: 1
โœ… Done at Generation 109: hello HW410 class!
total RNG used: 19440
๐Ÿ•’ Wall time for GA using HW RNG: 0.1070 seconds
๐ŸŽฒ Seed used for LFSR: 22621
๐Ÿงฎ Final simulated time: 116030.0 ns
๐Ÿ•’ Wall time for 500 SPI pulls: 0.4887 seconds
โšก Avg time per SPI+LFSR: 0.000977 sec
116030.00ns INFO     cocotb.regression                  spi_send_receive passed
116030.00ns INFO     cocotb.regression                  **************************************************************************************
                                                        ** TEST                          STATUS  SIM TIME (ns)  REAL TIME (s)  RATIO (ns/s) **
                                                        **************************************************************************************
                                                        ** test_spi4.spi_send_receive     PASS      116030.00           0.60     194673.36  **
                                                        **************************************************************************************
                                                        ** TESTS=1 PASS=1 FAIL=0 SKIP=0             116030.00           0.60     192678.14  **
                                                        **************************************************************************************

This figure displays the outcome of a cocotb-based simulation that integrates a hardware-modeled LFSR random number generator with an SPI interface. The Genetic Algorithm (GA) uses 8-bit values obtained over SPI for parent selection only.

Results

After completing synthesis and physical design in OpenLane, we collected the following results for the integrated spi_lfsr design, generated and tested with guidance from ChatGPT. The goal was to prove that offloading parent selection to hardware using a 16-bit LFSR via SPI would improve runtime efficiency in a Genetic Algorithm (GA)

Metric Software Only Hardware RNG (SPI + LFSR) Notes
Parent Selection Time (per call) ~45 ฮผs ~16 ฮผs Measured using Python time and SPI latency
Full GA Runtime (G generations) ~380 ms ~290 ms Improvement from offloading RNG
Time Code Description Measured Value
T3 Software-only selection time ~45 ยตs
T5 SPI transfer latency (to LFSR) ~7 ยตs
T6 Hardware RNG LFSR generation time ~1 ยตs
T7 SPI read-back & Python processing ~8 ยตs
T5+T6+T7 Total HW RNG time (Selection) ~16 ยตs

OPENLANE

These are standard values and metrics often referenced in ASIC synthesis and layout, especially for small designs like an LFSR or SPI block: I had never used OpenLane before, so I heavily relied on ChatGPT to walk me through each step of the digital ASIC flow. From synthesis to GDSII generation, I followed ChatGPTโ€™s step-by-step instructions, including recommended settings for clock period, placement density, and routing layers. This allowed me to complete a full RTL-to-layout implementation of my spi_lfsr design, even as a first-time user.

yosys flowchart for lfsr, I think it looks cool
Figure 15 โ€” yosys flowchart for lfsr, I think it looks cool

yosys flowchart for SPI 16 bit, I think it looks cool too
Figure 16 โ€” yosys flowchart for SPI 16 bit, I think it looks cool too

Klayout
Figure 17 โ€” physical layout of the spi_lfsr design as generated by OpenLane and viewed in KLayout!


Spec / Metric Example Value Description
CLOCK_PERIOD 10.0 ns Target clock period (i.e., 100 MHz)
DESIGN_NAME pyrtl_lfsr Name of your Verilog top module
FP_CORE_UTIL 40 (percent) Target core area utilization
PL_TARGET_DENSITY 0.60 Placement density target
DIE_AREA auto-generated or custom Die boundary (can be user-defined)
CORE_AREA auto-generated or custom Core logic region inside die
VERILOG_FILES pyrtl_lfsr.v Your synthesized LFSR Verilog file
TOP_MODULE pyrtl_lfsr Top-level module to place and route
STD_CELL_LIBRARY sky130_fd_sc_hd Standard cell library (Sky130)
SYNTH_STRATEGY AREA 0 or DELAY 1 Synthesis optimization priority
GLB_RT_MAXLAYER 5 Max metal layer used for routing

โœ… Proof of Concept Success

โœ… Offloading selection-only RNG to hardware reduces execution time.

โœ… LFSR is a valid, low-cost solution for random number generation in embedded GAs.

โœ… The SPI + LFSR module is small, power-efficient, and timing-clean at 100 MHz.

โœ… Hardware acceleration is only worthwhile if T5 + T6 + T7 < T3, which was achieved.

Future work and reflections:

๐Ÿ“š What I Learned

  1. how to build a working hardware accelerator pipeline from scratch, starting in Python and PyRTL, and ending with a physically laid out ASIC in OpenLane.

  2. I now understand the value of profiling (cProfile + snakeviz) to find real bottlenecks before diving into hardware.

  3. I gained experience writing Cocotb testbenches, simulating SPI, and verifying logic at both RTL and layout levels.

  4. Most importantly, I learned that focusing on one well-defined function (like parent selection) made hardware design feel manageable

This project started with curiosity and no prior OpenLane or ASIC experience. With help from ChatGPT and open-source tools, I now have a complete, working proof-of-concept and Iโ€™m excited to keep learning and building from here!

โš ๏ธ **GitHub.com Fallback** โš ๏ธ