Project: HDL Design - tnl3pdx/ece510-HwAIML GitHub Wiki
HDL Design
My choice of HDL is SystemVerilog (SV) since I am accustomed to SV through prior classes. To begin development, I needed to understand fully how this algorithm works. From searching online, I found a very useful website that helps visualize the SHA-256 algorithm created by dmarman.
Website Link: https://sha256algorithm.com/
General Algorithm
The basis of SHA-256 is that you take an input message and load it into a message block. The message block maintains a length that is a multiple of 512. At the end of this message block, the algorithm puts a 32-bit integer of the total length of the original message. After this, the message block is broken up into chunks of 512 characters (or 16 lines). At the end of each chunk, there is a 64-bit integer that gets appended. Each chunk is loaded into a schedule that holds 64 lines, and an extension algorithm uses the message to produce the remaining 48 lines. Once the extension is done, the algorithm goes line-by-line through the message schedule and uses a variety of functions to update a set of working variables initialized with certain values. Once all lines have been scanned, the working variables are added to the current hash variables. If there are any other chunks in the queue, then those chunks are processed in the same way as the first, where at the end of each block, the working variables are added to the same current hash variables. After all blocks have been computed, the final hash values are concatenated together to produce the 256-bit hash.
Algorithm to HDL Description
Based on this description of SHA-256, I know that a top module is needed to serve as a container for all the other modules. Inside this top module, I need a controller to prep the message block, and some kind of loop to iterate through each of these
Iteration #1: Initial Design
Before I figured out a good final design for the HDL description, I first simply made two modules: the message controller and the compression loop using the following prompt:
Generate me a full implementation of SHA-256 in SystemVerilog using multiple submodules. Create a memory buffer in the top module to contain the message buffer. Create the following modules:
A message controller that takes in 8-bit chunk inputs to save to the memory buffer per clock until the end of file is reached. Keep count of the number of bits saved using a counter. Once the end of file is reached, append a 1 to the buffer, pad the memory buffer, and save the counter value to the end of the message buffer as a 64-bit endian integer to extend the block to a multiple of 512. Compute the number of blocks present in the memory module and output an 8-bit signal indicating the number of blocks present. When the compression loop requests memory from the memory buffer with an address (word addressable), output 32 bits per cycle (read 4 bytes at a time).
A compression loop that loads in 16 32-bit words from the memory buffer to a new memory module formatted for 64 words total (32 bits per word). Iterate the loop per block indicated by the message controller. Calculate the reset of the message schedule based on the SHA256 algorithm. For every line in the message schedule, update the working variables using k constants and helper functions Once all 64 words are iterated through, check if there are any other blocks available from the message controller. Iterate through the loop again with those values. Once all blocks have been processed, append the working hash variables into one 256-bit value and return this to the top module.
As for the testbench, I asked the LLM the following prompt:
Generate me a testbench to test my SHA256 module in SystemVerilog that takes an example string and passes each character of the string to the SHA256 dut every clock with the data valid signal asserted. Once the end of the string is reached, signal the end of file signal high. Keep cycling until the hash_valid signal is asserted high and save the data outputted from the dut
However, I found some problems with this implementation once I moved to simulation, which is detailed in my log: Link.
After a lot of manual troubleshooting with the assistance of an LLM, I was able to get a design that could produce the expected hash value.
HDL Design Layout for Iteration #1
The design is layout out as a top module that instantiates the message controller and compression loop.
In the message controller, there is a message buffer register that I holds the entire message, and the controller module is responsible for taking in characters into the msg_buf register, appending a 1 to the message, and allocating 0s to the message buffer and leaving a section for the 64-bit integer that holds the total message length. The end result is a message block with a length that is a multiple of 512.
In the compression loop, it will see that the message controller is ready, and it will request data using an address to load in a chunk of data that consists of 16 32-bit values (512 bits) into a message register array. The compression loop will then use the 16 values to extend the schedule to reach a total of 64 32-bit values using some helper functions to compute new values, which complete a schedule. The compression loop then instantiates working variables using the initial hash values, and updates these working variables by obtaining the item in the schedule with the corresponding k-value stored in the k_rom register array, computes some temporary values, and updates the working variables. At the end of the schedule, the working variables are added to the initial hash values, and if any other blocks need processing, the cycle starts again to process the remaining blocks. Once all blocks are finished, the final output is given through the concatenation of the hash values.
Iteration #2: Fixing array length
After attempting synthesis with this design to see if I could get some useful metrics, I encountered some issues regarding routing congestion. Openlane seems to struggle a lot in this stage of the process, and I needed to find why this was happening. I asked the LLM the following prompt in my investigation:
I am attempting to synthesize my design using openlane2, but it fails due to high routing congestion. Is there something in my design that I should optimize? Or what openlane2 options should I add to optimize the process?
The LLM suggested the following steps:
I was interested in the macro path, so I asked the LLM:
Does Openlane2 have support to create block RAMS or something like a ROM?
I did some online searching for OpenRAM, and it seems to be a tool for creating memory macros that I could use in OpenLane. I expand this more in my HDL Synthesis, which can be found here. For a quick summary of the events describe in that section, I could not get OpenRAM to generate macros for custom-sized configurations (others have had this issue as well), and their ROM creation is still under development.
Instead of looking for options in the synthesis side, I wanted to see if I could optimize parts in my design. Going back to my code, I realized that I used a lot of large registers to hold things such as my message buffer array, message scheduler array, and my k-value ROM array. I suspected that synthesizing the logic needed to control these large arrays is too much for Openlane.
To fix this, I changed these large arrays to separate clocked RAM modules and used a case statement for the ROM. To implement this, I had to change the way my program used these resources, so that the logic is delayed by a cycle to retrieve data from the arrays. Most of the troubleshooting was done by hand with the assistance of an LLM, which is described here. By implementing these changes, I was able to synthesize my design with a max clock period of 15 ns.
HDL Design for Iteration #2
The major changes for Iteration #2 are that all register arrays are now replaced with clocked RAM/ROM modules to help with the congestion issues. The message buffer registers are now replaced using msg_buf, which is a 1W/1R 8x1024 RAM module, the message schedule register is replaced with a 1W/1R 32x64 RAM module, and the k-value registers are replaced with a lookup table using a case statement.
Iteration #3: Even/Odd RAM, Dual Read Ports, and Clocked Hash Output
For my third iteration, the parts of my simulation that took the most time were the extension of the schedule in the compression loop. This section requires iterating 48 times to complete the 64-item array for the next stage of compression. In the algorithm, the calculations for the next item in the schedule require 4 items from the previous rows in the schedule. The issue is that the w_ram module only has 1 read port, so I need to do 4 read cycles to store into registers, and 1 cycle to compute and output as the next item in the schedule. To solve this, I implemented two things: duplication of the RAM and dual read ports. I noticed that the indices that are needed for extension consist of 2 odd and 2 even offsets from the current item. I realized that I could split the memory into two smaller w_rams of half size, one that holds even items nd the other that holds odd items of the schedule. The benefit of splitting gives me a second read port to do the reading in parallel. So this reduces the required number of clocks down to 3 from 5.
To reduce the required clocks to 2 per iteration, I also added a second read port to each RAM module. This allows for 4 reads of the message schedule in one clock cycle, which reduces the total clocks per iteration to 2 clocks (1 read clock and 1 compute clock).
As for clock hash outputs, I had to change my output to be segmented since a 256-bit bus is somewhat difficult to do. The logic primarily required me to add a hash value and acknowledge signal from my compression loop to let the top module know when the hash is ready to be registered, and it begins to send out the data in 32-bit chunks. The reason for this was mainly for compatibility with SPI which mainly uses smaller-width buses such as 8, 16, and 32 bits.
Iteration #4: Multi-Compression Loops
With the previous iteration, I assumed that my hardware was in an okay position. But after finding out about my flaw with my profiling strategy, I saw that I needed to make my hardware even faster by nearly 5 times (considering my current max clock of 15 ns running a 1015 message test). To do this, I needed to look into using multiple compression loops. One part of the algorithm that should allow me to do this is that the addition of the initial hash value with the work variables should be a commutative operation. Thus, in theory, by adding more compression loops to compute all the blocks in parallel, I can achieve a much faster computation of the hash through parallelization. To do this, I asked the LLM to look at my current codebase with the following prompt:
Create a controller that can task each compression loop with a block of 512 characters from the message controller's message buffer. The message controller should provide data to each loop, and continue to provide blocks to empty loops until all blocks have been processed. Make a control layer between the start logic for the compression loops so that when the message controller is done, it provides data with the appropriate start signal to the corresponding loop. When the hash value is done from a compression loop and is outputted, capture the addition of the output to the h values segmented evenly between each value. Change any logic elements as needed to implement the control logic.
This generated a new FSM in my top module, which I tried to work as a starting point. My first goal was to understand how these compression loops needed to be connected. Currently, I only have a hash output and 2 signals that provide validity and acknowledgement from the compression loop. I saw that I needed to use these signals to obtain the hash value when a loop is done, and add it to the initial hash. The general scheme of the algorithm provided by the LLM was a round-robin, where it assigns tasks to each loop, and wraps to the front to wait for an output. I used a select 1-hot signal to point to 1 loop in the system. As for the number of compression loops, I started with 4 since adding more loops may not help the algorithm to speed up due to the dependence on this single resource. In theory, this should work, but I found that the hash output was not aligning with the expected hash.
I found that my assumption of the algorithm was wrong at this point. Using the SHA256 visualizer, I realized that the hash values need to be maintained across each iteration of the message loop. This means that when the working variables are added to the initial hash values, the output of this operation needs to be the initial values for the next loop. After consideration of some options, I figured that I could maintain the current round-robin method, but instead of retrieving the hash value to the top module, I would instead initialize the hash into the first loop, and have the output hash go straight to the next loop. Once the compression loops are done iterating, the loop after the selected loop by the 1-hot signal is the one that the top module retrieves the final hash. After many attempts to get the HDL to run properly, I was able to get an output hash that matched the expected surprisingly.
There was one odd thing about the current implementation, which was that my current algorithm could not compute a single block. Testing this using the testbench resulted in some odd behavior of the loop finalizing the hash result right away. To fix this, I added a special state in my FSM called SINGLE for this particular case.
When I was synthesising this design, I encountered a circular error as described in this section. It seemed that my current use of the read address was not ideal, as the behavior that I want my program to do is that when an address is applied, it gets the data right away. To fix this, I had to register the req_word from the target compression loop in the message controller. Once the message controller receives this registered flag, it will then obtain the message using the provided address. Then I implemented a delay flag for the compression loop to line up the data fetched from the message controller to the target compression loop.