Challenge #9 - ajainPSU/ECE410-510-Codefests GitHub Wiki

Overview

The objective of this project is to lay the foundation for a modular, expandable framework that accelerates parts of a QR code analyzer algorithm through hardware-software co-design. Following the broader goals set for the ECE 410/510 course project, this work prepares for the eventual design, testing, and benchmarking of a co-processor chiplet specifically tailored for efficient QR code decoding.

In line with the project guidelines, the design starts from a blank slate without relying on current accelerators (e.g., GPUs) or traditional CPUs (beyond basic execution support). Instead, the focus is on clean, modular code capable of being profiled, analyzed, and partitioned between hardware and software components to optimize performance. Special emphasis is placed on understanding dataflow, control flow, and memory access patterns to identify parts of the algorithm suitable for hardware acceleration.

The starting point for this framework builds upon the QR Recognizer code available at https://github.com/christofteuscher/hw4ai-sp25, which provides an initial implementation for QR code decoding. The capabilities of the provided code include:

  • Format Information Extraction: Reads the 15-bit format information that encodes error correction level and mask pattern.
  • Mask Pattern Application: Removes the mask applied to the QR code data, supporting mask patterns 0 through 7.
  • Codeword Reading: Correctly extracts codewords following the zigzag pattern while respecting function patterns.
  • Data Decoding: Supports all four QR code encoding modes:
  • Numeric mode (digits 0–9)
  • Alphanumeric mode (digits, uppercase letters, and select symbols)
  • Byte mode (arbitrary binary data, typically UTF-8)
  • Kanji mode (Japanese characters via Shift JIS encoding)

The implementation handles low-level bit operations, correctly interprets variable-length character count fields depending on the QR version and mode, and manages byte boundary crossing. However, several limitations remain:

  • Primarily designed for Version 1 QR codes (21×21 modules).
  • Reed-Solomon error correction is not fully implemented.
  • Perspective transformation is simplified, potentially impacting skewed QR codes.
  • Some edge cases in decoding might not be handled perfectly.

Building upon this baseline, the project will progressively extend, modularize, and analyze the system for deeper hardware-software integration. In which the objectives are for Challenge #9:

  • Deep understanding of the algorithmic problem and answering the Heilmeier questions.
  • Profiling the code using tools like cProfile, memory_profiler, and call graph generators to expose performance bottlenecks.
  • Generating dataflow graphs and high-level block diagrams to visualize and optimize the algorithm.
  • Establishing a baseline software benchmark to evaluate future hardware improvements against.
  • Preparing for the eventual partitioning of the system, identifying computational hotspots for hardware acceleration candidates.

Lastly I would like to make an acknowledgement to the tools used, particularly the LLMs specifically to ChatGPT 4.0 to help with an initial framework and design for the software and Github Co-Pilot for support in debugging.

Initial Software

The initial software framework was developed as an extension to the baseline QR code recognizer provided in the course repository. Building upon the foundational capabilities, this version introduces critical enhancements aimed at extending functionality, improving accuracy, and preparing the codebase for future hardware-accelerated optimization. This file can be found in the Challenge #9 folder and designated as "IB.py". Inquiries made to GPT were of those to implement the solutions/further development as seen from the original software.

Compared to the original implementation, the following major additions and improvements have been made:

  • Reed-Solomon Error Correction Support A complete Reed-Solomon decoder was implemented, based on GF(256) arithmetic. The decoder calculates syndromes, detects errors, and attempts to correct erroneous codewords, significantly improving the reliability of decoded data even in the presence of minor image distortions or noise. This prepares the project for full QR code standard compliance, which requires error correction.

  • Real Perspective Transformation A perspective correction module was developed using explicit homography matrix computation and warping. This replaces the simplified transformations from the original framework, allowing the system to properly handle skewed or rotated QR codes. This significantly enhances the robustness of grid extraction under non-ideal image conditions.

  • Support for Higher QR Versions The extraction logic was rewritten to generalize for higher QR code versions (beyond just Version 1). The grid sampling now dynamically adapts based on the specified version number, paving the way for decoding larger QR codes with more embedded data.

  • Hardware-Oriented Preparation Throughout the code, trace points were embedded to log dataflow events (self.trace(label, data)) whenever key operations occur. This explicit trace infrastructure prepares the codebase for profiling, memory analysis, and eventual hardware/software partitioning. It enables tracking internal states and computational steps, facilitating later acceleration decisions and dataflow graph generation.

Main Components of the Code include:

  • GF256 Class Implements Galois Field arithmetic necessary for Reed-Solomon operations, including efficient multiplication and division.

  • ReedSolomonDecoder Class Implements syndrome calculation, error detection, and error correction for codewords using GF(256) math.

  • PerspectiveTransformer Class Provides methods for computing the homography matrix and warping an input image to correct for perspective distortions.

  • QRCodeRecognizer Class Manages the full recognition pipeline, including:

  • Image loading and thresholding

  • Finder pattern detection

  • Grid extraction with perspective correction

  • Codeword extraction and error correction

  • Payload decoding based on QR encoding modes

However there are current limitations with the software:

  • Payload decoding is fully implemented only for byte mode; other QR modes (numeric, alphanumeric, kanji) are placeholders for future expansion.
  • Grid extraction assumes clear finder patterns; robustness to extreme image degradation is not yet implemented.
  • Hardware acceleration has not yet been applied; the trace data and profiling hooks are preparatory steps for future optimization.

In operation the software is designed to work as follows: First, the system loads a grayscale version of the input image and applies an adaptive thresholding operation to produce a clean binary (black and white) representation of the QR code. It then detects the three primary finder patterns by labeling connected components and calculating their centers of mass, ensuring correct localization even if the image is slightly distorted. Using these finder patterns, the system computes a full perspective transformation (homography) to rectify any skew and accurately extract a square QR grid corresponding to the specified QR version.

Once the grid is extracted, the system reads and reconstructs the raw codewords by traversing the modules in the appropriate pattern. It applies Reed-Solomon error correction to the extracted codewords to repair any corrupted data. The system then identifies the encoding mode of the payload (such as byte mode) and properly decodes the message based on QR specification standards.

Profiling

To evaluate the initial performance of the QR code analyzer and prepare for hardware-oriented optimization, profiling was conducted using Python's cProfile module and visualized with the Snakeviz tool. The profiler captured the execution of the full QR recognition pipeline on a Version 1 QR code image.

High-Level Performance Observations:

  • Total Execution Time: ~0.445 seconds for a full end-to-end decode.
  • Top-Level Node: The dominant execution node was the top-level Python execution environment (builtins.exec), as expected for script-level invocation.
  • Branching Behavior: The profiling flame graph revealed numerous tight inner loops and heavy branching, suggesting compute-intensive hotspots inside both the matrix transformation routines and the error correction mechanisms.

Identified Bottlenecks

Through flame graph analysis and inspection of cumulative execution times, the following primary bottlenecks were identified:

  • Perspective Warp (Homography Computation and Warping)
  • The pixel-wise warping process inside the warp_image function operates over the entire output grid, making it a strong candidate for optimization via loop unrolling, vectorization (e.g., SIMD), or fixed-point computation.
  • Reed-Solomon Error Correction
  • Polynomial multiplication, syndrome calculations, and error correction loops contribute significantly to runtime. These operations involve many branches and GF(256) arithmetic, suggesting acceleration opportunities via custom finite field ALUs or LUT-based evaluation.
  • Bitstream Decoding (Payload Parsing)
  • Bit slicing, shifting, and mode decoding within the payload interpretation stage presents a third bottleneck. Transitioning these operations into a finite-state machine (FSM) structure or custom logic block would streamline execution.

Hardware-Oriented Optimization Opportunities

Based on the profiling results, three major optimization strategies have been identified:

  • Loop Unrolling and Vectorization: Especially targeting the pixel traversal loops in warp_image and Galois Field operations (poly_eval, correct_errors).
  • Finite-State Machine (FSM) Redesign: For QR mode decoding and payload reading, separate FSM stages for mode identification, length extraction, payload parsing, and terminator handling are proposed.
  • Memory Trace Integration: The self.trace(...) logging infrastructure already embedded in the code will enable detailed memory access analysis, module activation tracking, and per-kernel execution profiling, providing valuable data for future hardware mapping.

Call-Graph

Figure 1: Snakeviz call-graph/flame graph of profiling IB.py

Memory Usage

Memory profiling of the QR code analyzer was conducted using memory_profiler during execution. The goal was to identify memory usage patterns, assess static versus dynamic allocation, and detect any possible memory bottlenecks or leaks.

Memory Growth Curve

The memory usage curve shows an initial rapid growth phase, reaching approximately 71–72 MiB within the first 0.5 seconds of execution. After this point, the memory usage stabilizes around 74–75 MiB and remains constant through the rest of the run. Importantly, there were no signs of uncontrolled growth or memory leaks, indicating a clean and stable memory footprint.

Memory Hotspots and Observations

Profiling revealed several key hotspots:

  • GF256 Table Initialization The largest memory allocation (~71.7 MiB) occurs during the creation of the exp and log lookup tables for Galois Field arithmetic inside GF256._init_tables(). Since these tables are static and constant, future optimization should move them to an immutable ROM block or synthesized LUT for hardware mapping.

  • Image Loading (load_image) A spike of +1.4 MiB is observed when loading and converting the input image into a NumPy array. This is a temporary allocation and can be optimized in hardware via DMA memory copy or zero-copy techniques.

  • Thresholding (threshold_image) Minimal impact on memory (~+0.1 MiB). Efficient, vectorized operations suitable for hardware vectorization.

  • Finder Pattern Detection (find_finder_patterns) Moderate memory usage (~+0.8 MiB) due to intermediate structures like labeled images and center-of-mass arrays. Possible optimization paths include compressing or offloading this step into a fixed-point FSM if hardware acceleration is needed.

  • Perspective Correction (compute_homography + warp_image) Combined memory footprint around +0.4 MiB. Involves dynamic allocation for the warped output grid; this can be optimized with line buffers or matrix tiling.

  • Reed-Solomon Error Correction (correct_errors) Shows very little memory use (~0.0 MiB increment), confirming that it is computationally expensive but not memory-heavy.

Overall, the system demonstrates good memory stability with a single large static footprint primarily from GF(256) tables, and moderate dynamic memory use during image processing and grid extraction. No memory leaks were observed. Memory optimization efforts will focus on moving static tables to hardware LUTs, reducing intermediate allocations during pattern detection, and exploring buffer-efficient techniques for image warping.

Memory Usage Plot

Figure 2: Memory usage plot of IB.py

Block Diagram

The following block diagram illustrates the high-level architecture of the QR code analyzer pipeline. It models the complete flow from image input to payload extraction, structured for easy mapping to a hardware-software co-design approach.

IB's Block Diagram

Figure 3: Block Diagram

System Overview

The pipeline consists of multiple modular stages, each responsible for a specific transformation or extraction step. Components are designed with hardware mapping in mind, allowing critical operations to be accelerated using custom FSMs, streaming dataflow models, or hardware-optimized blocks.

Key stages are:

  • Image Input: Receives QR code images from an external source. In which it's hardware interface can be done from a DMA buffer, RAM controller, or AXI stream input.

  • Grayscale Conversion: Converts input images from RGB to single-channel grayscale utilizing weighted averaging. It's hardware mapping can be done by fixed-function logic or bypass if its already grayscale.

  • Thresholding Logic: Binarizes the grayscale image into black/white utilizing mean or adaptive thresholds. Hardware mapping can be done via SIMD compare units or line-buffered pixel pipelines.

  • Finder Pattern Locator: Locates the three primary alignment patterns on the QR code, in which hardware can be done via FSM-based pattern scanning or optionally via CPU-assistance.

  • Homography Matrix Caculator: Computes a transformation matrix that rectifies distorted QR grids into straight squares which can be done via matrix inversion and SVD.

  • Perspective Warp Logic: Applies the computed homography to remap the input image onto a perfect square grid. It can be done via streaming a coordinate remapper or utilizing an image accelerator.

  • QR Module Sampler: Samples the warped grid to extract individual QR code modules (bits). Can use HW based on a tile-based pixel sampler with majority-vote aggregation.

  • Reed-Solomon Error Correction: Corrects errors in codewords using polynomial syndrome evaluation and solving. This one has subcomponents as: Syndrome Calculator, Berlekamp-Massey solver, Chien Search, Forney evaluator. It can use a dedicated GF(256) unit with LUTs and pipelined operations.

  • Mode Detector & Payload Decoder: Identifies the payload encoding mode (numeric, alphanumeric, byte, Kanji). Decodes payload bits into final data. Hardware Mapping can be done via a FSM parser or parallel decoder logic

  • Output Payload: Delivers the decoded information (e.g., text, URL) to the host system. Can be implemented with a UART, SPI, shared memory, or CPU memory-mapped interface.

However there are some design considerations:

  • Streaming Focus: Data flows continuously between stages wherever possible to minimize buffering overhead.
  • Modularity: Each block is independently replaceable, allowing incremental hardware optimization without major pipeline redesign.
  • Hardware-Software Partitioning: Early compute-heavy stages (image transformation, thresholding) are excellent candidates for hardware acceleration, while more decision-heavy tasks (finder pattern detection) can remain flexible in software initially.

Initial Suggested Hardware

Based on profiling and memory analysis of the QR code analyzer, several key hardware design strategies are recommended to optimize performance, area, and power efficiency. First, the Reed-Solomon error correction module—comprising syndrome calculation, Berlekamp-Massey solving, and Chien Search—should be implemented as a dedicated GF(256) accelerator with parallel multipliers and precomputed lookup tables. These operations are performance-critical but memory-light, making them ideal candidates for pipelined or FSM-based hardware blocks without heavy buffering overhead.

The homography matrix calculation and pixel warping stages are also prime acceleration targets. Both involve dense per-pixel computations over large grids and benefit from systolic matrix engines or 2D coordinate remappers with interpolation. Implementing line-buffered pipelines, loop unrolling, and fixed-point approximations will significantly reduce both latency and hardware resource demands. Static structures like the GF(256) tables (~71.7 MiB) should be moved into immutable ROM blocks or LUT arrays to minimize dynamic memory use. Control-heavy tasks, such as finder pattern detection and payload decoding, are best initially managed with software-driven finite-state machines (FSMs) or microcoded control blocks, offering flexibility without committing excessive silicon area prematurely.

Overall, a mixed architecture of custom hardware accelerators (for math-intensive and matrix-heavy stages) combined with software-manageable FSMs (for decision-based logic) provides the best balance for early prototyping, performance scaling, and future high-level synthesis (HLS) or RTL generation.

LLM Inquiries & Results

Attached below is my inquiries to ChatGPT-4o in order to make the initial algorithm. See the file "LLMC9-result.py" for the direct software it developed.

Q9-1a Q9-1b Q9-1c-result

Figure 4: ChatGPT inquiry for analyzing initial code and suggested improvements for accelerated.

Q9-2

Figure 5: Inquiry to ChatGPT for implementing said improvements to the algorithm.

Some things that didn't end up working out is that some of the profiling tools (graphviz/pycall) never worked via installation, as its subdependecies never installed. Also the LLM decided to gut parts of the original software, removing the other mode addressing types for QR codes, zigzag pattern recognition, format information extraction, no mask pattern application, no alignment, and some bit-level operations.