Project: Final Results and Performance Metrics - tnl3pdx/ece510-HwAIML GitHub Wiki

Metrics to Focus On

I will mainly put my focus on the signing in the SPHINCS+ program. The reason is that the signing stage is the longest out of the 3 stages. Creation of public/private keys and verifying that the signature is valid is much shorter in comparison to the signing stage, so this will be the main focus of the results breakdown. These are the metrics that need to be accounted for to determine the results of my implementation:

Software

Total execution time of the hash function for a single signing (how long the hash function takes during execution)
Average time for a hash call (on average, how long it takes for a call to finish)
Number of hash calls per signing (total number of calls to the hash function)
List of unique message lengths to the hash function (used to determine how many blocks need to be allocated for SHA-256)
Distribution of call counts by unique message lengths (need to know how many times a 64 or 80 message block has been passed through)

Hardware (HDL)

Max Clockable Frequency based on synthesis via OpenLane
Execution times for each unique message length (different numbers of message blocks will affect the execution time of a single call)
Communication cost between hardware and software layers

Results for Software

Based on my prior profiling of the project, I found that the hash function took around 14 seconds to compute the test.txt file located in Challenge 9's folder. On average, it takes 8.6 us for each iteration of the hash function, which was called 1734375 times.

SPINCS+ algorithm only uses 4 different message lengths: 64, 80, 208, and 608. A majority of the iterations are concentrated in the 64 and 80 message categories which take up 2 blocks in the SHA-256 computation.

Results for Hardware

Clock Speed of HDL

Using my results from my synthesis via OpenLane in Challenge 21, the max frequency of my design is 15 ns.

Execution times for Different Block Sizes

For individual execution times for each message length type, these were the simulated results using QuestaSim. Since this is hardware, only one iteration per message length is needed to confirm these findings. For the 64 and 80 character tests, the execution time for each block is 5.621 us, which is much faster than the software's average execution time per call.

Percentage Differences between Iterations

For a 1015 message test, these are the following execution times for each iteration of this project.

From Iteration #1/#2 to #3, this was a 36.75% reduction in execution time. From Iteration #3 to #4, this was a 43.09% reduction in execution time. From Iteration #1/#2 to #4, this was a 64.01% reduction in execution time.

Communication Time

For the communication time between HW and SW, this was modeled using an SPI bus, which sends 8 bits from the controller to the peripheral, and 32 bits from the peripheral to the controller. This was tested using cocotb, which can be found in Challenge 25. In total, the latency and throughput of this SPI setup were 0.705 us and 22695.04 kbps.

For a rough estimation of the total additional communication cost between SW and HW, the parts of the SHA256 HDL call that will be affected are the transmission of the entire message and the latency to start a transaction.

For throughput, the total time it will add is calculated by computing the number of bytes processed per signing and dividing this by the throughput of the SPI bus. This results in a total of 41.37 seconds, which is nearly 3 times the execution time of the hash function on SW.

As for latency, assuming that it takes 1 call to start a transaction and 1 call to receive the output hash, this means that per hash call, it takes 165.51 seconds for a single signing.

The total communication time per signing is 206.89 seconds.

Final Results

Using the data provided in the previous 2 sections, the speedup between the software and hardware implementations of SHA-256 without considering communication time is:

As for with communication time, the speedup is: