Challenge #21 - ajainPSU/ECE410-510-Codefests GitHub Wiki

Overview

The objective of Challenge #21 was to revisit and refine the hardware/software co-design that was developed from Challenge #9 to Challenge #18 by doing a design iteration of the overall design. The general idea was to reassess the hardware/software partition and a re-evaluation of the bottlenecks to enhance the overall design to better suit, and to either enhance or scale down the project scope based on it's results.

Software Refinements

The second iteration of the software algorithm, denoted as IB2.py adds additional capabilities such as format information extraction, mask pattern reverse, and bit-accurate codeword reading. The original, initial version (IB.py) lacked these capabilities due to myself not being specific with Copilot and GPT-4o in terms of what exactly I was looking for, and thus omitted them from the original algorithm I was basing them off of.

IB2 now implements functions called extract_format_information() and validate_format_information(), in which they are to correctly parse the 15-bit format field from the QR grid utilizing fixed spatial offsets and apply the BCH error correction polynomial to validate it, in which IB.py had no such functionality. unmask_data(), is_function_pattern(), and is_masked() were also added to resolve the XOR-based masking applied to the data modules. read_codewords() handles the codeword extraction logic, and now includes the performance to conduct zigzag scanning, starting from the bottom-right of the grid whilst skipping function patterns, in which in the original algorithm (IB.py) this was reliant on a flat region slicing without any regards to functional module locations or any QR traversal order. IB2.py also has the additions of Numeric, Alphanumeric, Byte, and Kanji modes for bit-level payload decoding.

However for all of these additions, IB2.py doesn't have support for more than QR version one, alignment patterns, or timing pattern refinement. In addition to having basic levels of error correction.

Thus, IB3.py which is the most up-to-date iteration of the software algorithm of the design resolves these issues.

Profiling of New Iteration

Figure 1 below shows the snakeviz flame graph of IB3.py with testingstuff.py as the workload was tested.

C21-Snakeviz2

Figure 1: Snakeviz flamegraph of IB3.py and testingstuff.py

In which the analysis of Figure 1 still highlights warp_image() and correct_errors() as two of the largest computation spots that could be offloaded into hardware. In essence they still signal that they are both long-running and heavily invoked throughout execution. Furthermore, warp_image() performs the per-matrix transformations for perspective correction, which involves inverse matrix multiplication and bound checking, which are operations more well-suited for pipelined hardware acceleration. correct_errors() also repeatedly invokes the GF(256) which are ALU intensive and more ideal for fixed-function hardware, so it still suggests that both these functions are ideal for acceleration. In addition it also suggests that read_codewords() can also be a candidate, since the zigzag traversal and bit slicing logic might be helpful, though its computational footprint isn't critical.

Figure 2 below shows the snakeviz flame graph of IB3.py itself.

C21-SnakevizIB3

Figure 2: Snakeviz flamegraph of just IB3.py itself.

Although Figure 2 shows IB3.py's profiling by itself with no tests undergoing, correct_errors() still shows as the central bottleneck due to the GF(256) operations which are indirectly called via poly_eval(), mul(), and div(). However more interestingly is that warp_image() was less distinct, which may be because it's not actually analyzing an image during profiling of the actual algorithm.

Benchmark of New Iteration

Figure 3 below shows the benchmarking of IB3.py in terms of QR codes/sec, calls/sec, and warps/sec. These refer to the hardware accelerated components that were identified for Challenge #9 and reaffirmed here. Although in Challenge #9 the benchmarks weren't made, this was corrected here. These benchmarks resulted in 17,311.36 calls/sec from poly_eval(), 11.64 warps/sec from warp_image(), and 11.64 QR/sec for QR codes. If we calculate for throughput to compare with the OpenLane analysis of the HW design section, I made the assumption of assuming 200 bytes for QR code (just to make a general calculation), which results throughput as: (200 bytes/QR) * (11.64 QR/sec) = 2328 bytes/sec -> 2328 bytes/sec / 1,048,576 (1 MB) = 0.00222 MB/s. In comparison the HW accelerated section from OpenLane results in 14.3 MB/s, showing that the accelerated hardware is exponentially better than pure software.

Benchmark-Results

Figure 3: Benchmark results of specific hardware accelerated components.

Issues

One of the main issues with testing the new design iteration was that when testing the alphanumeric portion of the software in a testcase -- in which I set the test bits to represent "AB", it would recognize them as K9. This was fixed since in aphanumeric mode a QR code is done in a single 11-bit value using: value = 45 * first_char_index + second_char_index.