HLS - alex-aleyan/xilinx GitHub Wiki

  • Academy - High-Level Synthesis with the Vitis Unified IDE
    • Labs
      • Lab 2 GUI: skip Step 4 to Step 5, finish and come back to Step 4.
    • Explore another flavor of this tool not covered in this HLS training - Vitis Unified IDE
      • Vitis Kernel Flow (HLS is Vivado IP Flow) for AI Engine graph applications in Heterogeneous Compute Systems
      • UG1394 Vitis Kernel Flow.
      • UG1393 Vitis Unified Software Platform Documentation: Application Acceleration Development.
      • Academy - Accelerating Applications with the Vitis Unified Software Environment
      • Academy - Embedded Heterogeneous Design
      • Academy - Designing with Versal AI Engine: Architecture and Design Flow - 1
      • Academy - Designing with Versal AI Engine: Graph Programming with AI Engine Kernels - 2
      • Academy - Designing with Versal AI Engine: Kernel Programming and Optimization - 3
      • Academy - Designing with Versal AI Engine: DSP Applications
    • constructs such as dynamic memory allocation, file I/O, and recursive type are not supported.
    • HLS directives allows to exploration in very short time.
    • The generated RTL can be used in Vivado IP Integrator, Vitis Model Composers, Vitis Development Environment.
    • Use selfchecking C Testbench (return 0 if pass, NON-ZERO if fail).
    • C Simulation vs C Simulation with Vitis HLS Code Analyzer.
    • HLS Viewers terminology:
      • Scheduling and Binding:
        • Scheduling - schedules operations w.r. to clock cycles. Pipelines your C code.
        • Binding (mapping) - determines which HW resource implements each scheduled operation.
      • Control Logic Extraction - turns loops into FSM driven RTL.
    • Viewers:
      • Analyzing Synthesis results:
        • Schedule Viewer: RTL Operations/control and the clock cycle it takes to execute;
        • Function Call Graph Viewer: throughput (latency, initialization) and bottleneck. Finds potential stalls and deadlock.
        • Dataflow Viewer: channel depth impact. Great performance debugging for deadlocks, stalls. Must apply DATAFLOW pragma/directive for this viewer to be populated with the results.
      • Analyzing C/RTL Cosimulation:
        • Timeline Trace Viewer: shows the run time profile of the functions. This is like a bird's eye view of your RTL simulation but applied to C functions telling you how the data is schedules through and moves through your design with respect to clock cycles.
    • Abstract-level Parallelism (task-channel, dataflow optimization) vs Instruction-level Parallelism.
    • TLP(Task-Level Parallelism): Data-driven Model vs Control-driven Model
      • Data-driven Model (TLP-D)
        • Data is NOT interdependent - A bunch of combinatorial logic that can be pipelined via registers (must not involving any control feedback like sink's READY signal?)
        • Do not require interaction with the outside hierarchy/memory.
        • The task behaves like a FIFO meaning once the data read from the task, the same data cannot be read again (it's not a memory - it's a FIFO).
        • Specified as hls_thread_local hls::stream<int> <vars>;
        • C++ class: ls::task
        • Feedback Design (Cyclical path between Tasks - bursts?), Dynamic Multi-rate Models.
        • Example:
          void f1(){}
          void f2(){}
          void f3(){}
          void dut(int *in, int *out, int n){
              hls_thread_local hls::stream<int> s0, s1, s2, s3, s4;
              // order of KPN tasks does NOT matter
              hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven
              hls_thread_local hls::task task1(f2, s1, s2);     // data-driven
              hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven 
          }
          
      • Control-driven Model (TLP-C; a.k.a. Dataflow Optimization)
        • Data is interdependent:
          • different data arriving for different TCP connection and should not be mixed between the TCP clients.
          • data arrives in parallel in bursts and the parallel bursts are related and not to be mixed. Think of TCP control messages in parallel with their data; data from one control messages shall not me confused for the data for another control message.
          • Think of TDM (Time Division Multiplexing). You canNOT switch packet places.
        • Turns a series of sequential function into a pipelined architecture.
        • Example:
          void diamond(data_t vecIn[N], data_t vecOut[N] )
          {
              hls::stream<int,N> c0,c1,c2,c3,c4,c5;
              #pragma HLS DATAFLOW
              Load(vecIn, c0);
              Compute_A(c0,c1,c2);
              Compute_B(c1,c3);
              Compute_C(c2,c4);
              Compute_D(c3,c4,c5);
              Store(c5,vecOut);
          }
          
      • Mixing Data-driven and Control Drive:
        void f1(){}
        void f2(){}
        void f3(){}
        void dut(int *in, int *out, int n){
        #pragma HLS dataflow
            read_in(in, n, s0);
            // s3 is feedback, wound not easily work in csim
            hls_thread_local hls::stream<int> s0, s1, s2, s3, s4;
            // order of KPN tasks does NOT matter
            hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven
            hls_thread_local hls::task task1(f2, s1, s2);     // data-driven
            hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven 
            write_out(s4, out, n);
        }
        
    • HLS Directives - Pragmas:
      • Default (# of cycles = # of loops x clock cycle per iteration; TRAIN) - loops are rolled.
      • UNROLL (1 clock cycle; STACK) - unrolls or partially unrolls FOR loops
      • PIPELINE (# of cycles = # of loops + clock cycle per iteration; STEP) - allows parallel execution of the operations in a single loop. Just like sw pipeline with car wash example - the loop does not wait to complete to load next data .
      • DATAFLOW - parallel execution of multiple loops/functions. FIFOs and Ping-Pong buffers (BRAM)
    • Performance Metrics:
      • Initiation Interval - # of clock cycles between new input samples. Most critical performance metric for latency.
      • Loop latency - # of clock cycles required to execute all the iterations of the loop.
    • Arrays
      • Memory is off-chip, BRAM, interface of the module is synthesized to work with memory.

      • BIND_STORAGE specifies whether the Single Port BRAM (default) or Dual Port BRAM is used

      • Array partitioning - split array to match the max word size (72 bits) supported by the BRAM.

        • block partitioning, the original array is split into equally sized blocks of consecutive elements of the original array as shown here.
        • cyclic partitioning, the original array is split into equally sized blocks interleaving the elements of the original array.
        • complete partitioning, the default operation is to split the array into its individual elements. This corresponds to resolving memory into registers.
      • Array reshaping (ARRAY_RESHAPE directive) - Array reshaping combines the array elements into wider containers, and it allows more data to be accessed in a single clock cycle. So like an array of 8 bit elements placed into a 72x512 BRAM so we can access 9 8-bit words at once?

    • v++ (synthesis) and vitis-run (simulation/cosimulation/implementation/export IP|XO) commands.
      • source settings64.sh
      • vitis, vitis -g, or vitis -w
      • v++ (UG1393)
        • v++ --compile (launched HLS or AI Engine compiler modes)

        • v++ --link (Links PL or AI Enginer Kernel files for creating bin files)

        • v++ --package (packages all bin files and boot files into SD card image)

        • v++ --config arg (specifies configuration file).

        • v++ --work_dir (specifies working dir).

        • v++ -c mode hls (used to synthesize HLS).

        • v++ -c mode hls --config hls_config.cfg --work_dir dct

        • Config File:

          • part option specifies a target device for the HLS component. Note that if you use the platform option instead of the part option, then you must also specify a freqhz option instead of a clock option to change the default clock frequency of the platform.
          • clock option specifies the clock period in ns or MHz (ns is the default). If no period is specified, a default period of 10 ns is used.
          • flow_target option sets the flow target. Set the value to vitis to synthesize as a Vitis™ kernel (.xo file) or set it to vivado to synthesize as Vivado™ IP.
          • syn.file and tb.file options specify the file path and name of source files and test bench source files, respectively.
          • clock_uncertainty option specifies how much of the clock period is used as a margin by HLS. It is defined in ns or as a percentage of the clock period. The clock uncertainty defaults to 27% of the clock period.
      • vitis-run - used to enable C simulation, C/RTL cosimulation, and Vivado™ implementation of an HLS component.
        • vitis-run --mode hls
        • vitis-run --csim (run C simulation on HLS component).
        • vitis-run --cosim (run C/RTL cosimulation on HLS component).
        • vitis-run --impl (run Vivado Implementation out-of-context OOC on HLS component).
        • vitis-run --tcl (run Vitis HLS using tcl like batch mode)
        • vitis-run --work_dir (specified work dir for -cosim and -impl; specified dir must contained compiled HLS component).
        • vitis-run --config arg (see v++)
      • Here are some HLS component development steps:
        • Run C simulation using the vitis-run command: vitis-run --mode hls --csim --config hls_config.cfg --work_dir dct

        • Run C synthesis using the v++ command: v++ -c --mode hls --config hls_config.cfg --work_dir dct

        • Run C/RTL cosimulation using the vitis-run command: vitis-run --mode hls --cosim --config hls_config.cfg --work_dir dct

        • Run implementation using the vitis-run command: vitis-run --mode hls --impl --config hls_config.cfg --work_dir dct

        • Export the HLS component output using the vitis-run command: vitis-run --mode hls --package --config hls_config.cfg --work_dir dct

          • package.output.format=<xo|ip_catalog|syn_dcp|sysgen|rtl>
    • Directives:
      • Using directives is an alternative way of using pragmas in the source code.
      • add either through the GUI or using the command line with syn.directive.= commands written in the configuration file
        • Syn.directive.pipeline=dct2d II=4
          • DIRECTIVE: PIPELINE,
          • LOCATION: FUNCTION, LOOP, REGION, VARIABLE.
          • ARG: depends on the DIRECTIVE. allow you to customize the synthesis results for the same source code across multiple implementations (similar to Strategies in the Vivado).
    # Lab 2 notes (vitis -w $TRAINING_PATH/hls_command_line/lab)

    # Change dir to the "hls_command_line/lab/dct"
    cd dsp-hls-2023.2-rev1-lab_files/training/hls_command_line/lab/dct/

    # Set up the config file:
    vim hls_config.cfg 
    
    vitis-run --mode hls --csim    --config hls_config.cfg --work_dir dct   # Run C Simulation & 
    v++ -c    --mode hls           --config hls_config.cfg --work_dir dct   # Synthesize the C
    vitis-run --mode hls --cosim   --config hls_config.cfg --work_dir dct   # Run Cosimulation:
    vitis-run --mode hls --package --config hls_config.cfg --work_dir dct   # Package the IP
    vitis-run --mode hls --impl    --config hls_config.cfg --work_dir dct   # Run Implementation/Place/Route:
⚠️ **GitHub.com Fallback** ⚠️