Project: Software Implementation and Benchmarking - tnl3pdx/ece510-HwAIML GitHub Wiki
Software Implementation of SPHINCS+
For my initial approach, I wanted to generate a SPHINCS+ class using an LLM for my software implementation. The prompt that I used was:
Implement a full-featured SPHINCS+ encryption algorithm that can be passed an input file to be encrypted/decrypted based on a flag argument. Utilize a class structure for the algorithm and its functions, and do not use outside libraries that already implement the SPHINCS+ algorithm prepackaged. Use libraries to make the configuration and control of the Merkle trees easier.
In this prompt, I address a couple of things to make sure that the generated code is something that I can work with.
- Full-featured: I wanted to make sure that the algorithm would not abstract the algorithm too much, so that I can break down the functionality into more understandable sections.
- No outside libraries: I wanted to make sure that library use was kept to a minimum, as it can be much more difficult to navigate and edit the functions of a library compared to a pure implementation of the algorithm.
- Configuration and control of Merkle trees: I was hoping that there were libraries that could instantiate these trees used in the SPHINCS+ algorithm. But the LLM decided to make its own implementation of these tree algorithms, and I did not want to spend more time waiting for the code generation, so this section of the prompt can be disregarded.
I spent some time trying to get the algorithm to work, and I was able to get my WOTS+ function to work. However, my FORS function implementation had trouble with producing the correct hash.
Since I did not have much time after 2 failed attempts with other project ideas, I looked for an open-source SPHINCS+ repository. I found a Python version of SPHINCS+ from tottifi which can be found here: https://github.com/tottifi/sphincs-python. I used this code as my basis, so that I could get an understanding of what the main bottleneck of the algorithm is in software.
Profiling the Bottleneck
Using my knowledge from prior challenges and past iterations with my LLM-based SPHINCS+ implementation, I found the best way to profile was to create a profile class that uses decorators to profile certain functions in a codebase. Conveniently, I created this class when I was working on my prior SPHINCS+ implementation, so I transferred the profiling class to use with tottifi's code. I attached decorators to the sign and verify functions from the SPHINCS+ class. When running these functions, I was able to generate a call graph and profiling data to see which functions were called the most and what took the most execution time.
Note: This image can be opened in a new tab to see the finer details in the image.
These results show that the hash function consumed the most time and was also called the most out of all the functions in the SPHINCS+ class. The execution of the signing stage takes up around 46.40% of the total time. The SPHINCS+ class utilizes a SHA-256 hashing function to hash messages, and I decided that this was the component that I wanted to accelerate using an HDL implementation.
Additional Profiling Metrics
There were some more details that I wanted to profile to know how to design my HDL code:
- The max message length that is the SPHINCS+ algorithm generates and passes to the hash function.
- The execution time of a single hash function for a 1024-character message.
For the max message length, I tested a small file to sign and a large file to sign. This metric is important to know since this determines how large of a message buffer I need for my HDL design.
For the short file which was around 1 KB, the longest input message to the hash function was 608 characters long, which was consistent for the signing and verifying functions.
![]()
For the long file which was around 1.023 GB, the longest input message to the hash function was 608 characters long, which was consistent for the signing and verifying functions.
![]()
Surprisingly, the SPHINCS+ algorithm's use of SHA-256 seems to have a fixed message length of 608 characters. This means I can make an array of 1024 items to hold each 8-bit character, which will be plenty for the algorithm to use. I chose 1024 items to keep the convention of memory arrays being powers of 2, so it is easier to address the memory array using a binary address.
As for the execution time of the hash function, I tested the hash function from the hashlib Python package to see how long a 1024 message takes to process. The source code for this can be found here: Source Code
In my later designs, I found that due to the 64-bit number that gets appended to the message block, the real max message length to fill the message buffer is actually 1015 characters, since 9 of those characters are allocated for the algorithm's use. So for testing, I used a 1015-character message to hash, and the following call graph was created:
Problem with Profiling Approach
With my prior use of profiling, I used call graphs to determine the execution time of certain functions as seen in Part 2 of this section. I did not think of it much about the validity of these metrics, but after revisiting my profiling method, I found a major flaw. In the creation of these call graphs, these numbers seem to be inflated in comparison to the actual Cprofile .prof file provided. Using Snakeviz to view the .prof file, I see that the actual run time of a call to the hash function is around 11 us, which is 10 times faster than the supposedly correct average metric of 102.2 us for a 1015 message hash.
My solution to my profiling approach is to break down the message lengths per iteration of the hash function. By doing so, I can better visualize what exactly is going into the hash function. I asked the LLM to produce a message tracker class that could keep a total of the number of times the function is called, and also to keep a histogram of the iterations per category:
Add a container to keep track of message length values
Can you edit this class to save the results in a file?
These two prompt provided a class that could provide me with details of the hash function which can be found here. With that, I used the test.txt file to generate some data.
In addition, I tested this with a much larger file that I had in hand, which is the size of 1 GB. This produced a very similar count as the small test.txt file:
What was quite surprising was that a majority of the calls are for much smaller 64 and 80-character blocks. These equate to 2x512-bit message blocks, and 2 blocks is much faster to compute compared to the max 16 blocks.