Challenge #23 - ajainPSU/ECE410-510-Codefests GitHub Wiki

Overview

Week 7's challenge was to reevaluate and if applicable, scale back the project itself. In my analysis of the project, I have learned that I did not in fact need to scale back, but instead scale up as my software and hardware components were missing elements and needed to be tested. Since from Challenge #9 and Challenge #15, they left out elements I needed to fully test this as a chiplet. Challenge #21 was used to redefine the Python elements, and this week Challenge #23 was used to redefine and expand upon the Verilog elements and also put them through another iteration of OpenLane to determine their new maximum clock frequency, number of transistors, and power consumption.

Improvements to the Verilog file

The new version of the verilog which handles the hardware portions of the warp_image and correct_errors module has extensive additions and improvements. In qr_hardware_accelerators_v2.v (refer to the file version in the OpenLane results folder) the signal arrays were flattened manually utilizing managed bus widths for the error_magnitude_flat function as compared to the indexed signal array format which was utilized in V1. This change was required in order for verilator and synthesis compatibility, since an array of vector I/O declarations isn't supported, and this also allows easier mapping to hardware ports and facilitates more predictable synthesis behavior.

There are intermediate signals (lambda_reg, omega_reg) that were renamed or registered to decouple computation from the input latching, in order to improve timing closure and allow a better pipeline potential. The V2 file reorganizes the signal flow from pipeline modules (bm_core, chien_search, forney_algorithm) into a more discrete stages. Generated blocks and array style module instantiations were manually removed, so each module is explicitly declared and connected.

The error_magnitude_flat signal is exclusively now driven by the forney_algorithm module and eliminates the possibility of multi-driven signals since there were simultaneous assignments inside and outside the instantiated modules.

The accelerator will in version 2 will instantiate:

gf_mult to perform GF(256) multiplication using log/antilog tables.
gf_inv to compute the inverse by subtracting a log value from 255 and mapping back.
poly_eval to evaluate the omega and lambda polynomials at a provided point.

Additionally from Version 1, Chien Search, Berlekamp-Massey, and Forney Algorithm modules were created and moved away from a more skeletal structure to a more functional one.

Results of Verilog Iteration #2

qr_hw_accelerators_v2.v was tested in a verilog simulation under the file qr_hw_accelerator_tbv2.v, in which during compiling it gave several minor errors which were attributed to signal declarations or unused ports, but there were no functional errors during the simulation.

The simulation produced the corrected output values that were placed from index 0 to 101, and and sequentially spanning from hexadecimal values 0x00 to 0x65, with a few intermediate values such as 0x0b and 0x35 intentionally injected or recovered so the module perform error correction accurately.

From Figure 1 below which shows the simulation output, all values appear to be monotonically increasing, and confirms the decoder successfully recovered byte-aligned payload values, validating proper polynomial evaluation, syndrome processing, and error location logic.

Results2A

Results2B

Results2C

Results2D

Figure 1: Verilog simulator results.

OpenLane Iteration #2 Results

Figure 2 below shows the output of the flow from OpenLane2 of the second iteration of the Verilog design.

EndResult1

EndResult2

EndResult3

Figure 2: Final Report of OpenLane2 design with iteration 2 of the HW design.

In the results of the flow, the Layout vs. Schematic (LVS) check passed successfully. This indicates that the synthesized netlist was faithfully mapped into layout, and the physical connectivity precisely matches the intended RTL description. Despite the increased structural complexity from additional modules and further expansion upon the current ones, no functional wiring errors or instance mismatches were introduced during placement and routing. The Design Rule Check (DRC) also passed with no violations. However, the Max Slew violations are primarily linked to how the expanded datapath modules increased both signal path lengths and capacitances. In particular: Modules such as read_codewords and unmask_data introduce additional control signals, grid address decoding, and nested iteration logic. The Reed-Solomon polynomial evaluation (gf256_poly_eval) and division units (gf256_div) contain combinational multipliers and feedback loops that create longer paths with increased signal capacitance. And the increased fanout and load on shared buses contributes to higher slew times, as the drivers are required to toggle larger capacitive loads within each clock cycle. The presence of violations across multiple PVT corners (fast-slow, nominal) indicates that both path length and driving strength are reaching limits where buffer insertion or clock pipelining would be beneficial.

Max Cap violations arise for similar reasons: The QR correction pipeline shares buses across multiple submodules (e.g., locator_fd, syndromes_fd, and output buses such as corrected_codewords). Combinational stages that feed multiple parallel units add fanout burden on shared data wires. These shared lines accumulate load capacitance that exceeds OpenLane's safe design margin at several corners. The problem seems to be amplified by Reed-Solomon's inherently parallel operations, where syndrome evaluations and locator updates access multiple memory cells simultaneously.

The antenna rule failures stem from the fact that multiple internal buses, address decoders, and syndrome storage blocks use long metal routes that feed large arrays of registers (reg [7:0] corrected_codewords [0:CODE_LEN-1]). Since the design lacks antenna diode protection on output nets, the violations are flagged. Now if I had more time I would spend more time trying to resolve the edge cases and antenna issue to make this flow an actual manufacturable design. In which I tried to resolve the antenna issue with using standard components which did alleviate part of the issue, but not fully, as seen below (config.json did not reflect this updated change when I uploaded it).

Additionally, I also didn't include the GDSII output since not much has changed except when viewing it there is an extreme layer of diodes which covers the full printout thus letting the user not being able to see what else is included on that level. Again, further revisions would be to fix the antenna issue and the warnings printed out by the flow.

"run_antenna_check": true, "run_antenna_drc": true, "run_antenna_fix": true, "antenna_cell_name": "sky130_fd_sc_hd__diode_2", // Standard diode cell "tap_cell_antenna_fix_strategy": 2, "run_openroad_antenna_check": true

Metrics of Iteration 2

In the file "Metric Notes.txt" I calculated the maximum clock frequency, the power consumption, and the estimated number of transistors from the reports that were given from metrics.json, these were the following results:

Maximum Clock Frequency: 1 / (setup_slack) = 1 / (13.2469e-9) ≈ 75.49 MHz
Data Power Consumption: power__internal__total": 0.006555 W "power__switching__total": 0.005355 W "power__leakage__total": 0.000000414 W "power__total": 0.011911 W

Internal: 6.56 mW Switching: 5.36 mW Leakage: 0.0004 mW

Number of Transistors: The estimate was ~6 transistors per cell. "design__instance__count__stdcell": 81518 Estimated Transistor Count = 81518 × 6 ≈ 489,108
Throughput: Throughput would be (1 byte/4 cycles) x Max Clock Frequency which is: (1 / 4) x (75.49 MHz) = 18.8725 MB/s

However if we were under the assumption of 1 symbol (8 bits) per clock cycle then throughput would be: 8 bits × 1 × 75.5e6 = 604 Mbps = 75.5 MB/s.

Comparison between Iteration 1 and Iteration 2

Maximum Clock Frequency: Iteration 1: 57.23 MHz Iteration 2: 75.49 MHz

Iteration 2 has a larger clock frequency by a difference of 18.26 MHz

Throughput: Iteration 1: 14.3 MB/s (114.46 Mbps) Iteration 2: 18.8725 MB/s (604 Mbps)

Iteration 2 has more throughput by 4.5725 MB/s

LLM Inquiries

The following figures were all the LLM inquiries made. Please note where GPT said "download this file" it is incorporated into the file qr_hw_accelerators_v2.v

Inquiry1