HLS - alex-aleyan/xilinx GitHub Wiki

Academy - High-Level Synthesis with the Vitis Unified IDE
- Labs
  - Lab 2 GUI: skip Step 4 to Step 5, finish and come back to Step 4.
- Explore another flavor of this tool not covered in this HLS training - Vitis Unified IDE
  - Vitis Kernel Flow (HLS is Vivado IP Flow) for AI Engine graph applications in Heterogeneous Compute Systems
  - UG1394 Vitis Kernel Flow.
  - UG1393 Vitis Unified Software Platform Documentation: Application Acceleration Development.
  - Academy - Accelerating Applications with the Vitis Unified Software Environment
  - Academy - Embedded Heterogeneous Design
  - Academy - Designing with Versal AI Engine: Architecture and Design Flow - 1
  - Academy - Designing with Versal AI Engine: Graph Programming with AI Engine Kernels - 2
  - Academy - Designing with Versal AI Engine: Kernel Programming and Optimization - 3
  - Academy - Designing with Versal AI Engine: DSP Applications
- constructs such as dynamic memory allocation, file I/O, and recursive type are not supported.
- HLS directives allows to exploration in very short time.
- The generated RTL can be used in Vivado IP Integrator, Vitis Model Composers, Vitis Development Environment.
- Use selfchecking C Testbench (return 0 if pass, NON-ZERO if fail).
- C Simulation vs C Simulation with Vitis HLS Code Analyzer.
- HLS Viewers terminology:
  - Scheduling and Binding:
    - Scheduling - schedules operations w.r. to clock cycles. Pipelines your C code.
    - Binding (mapping) - determines which HW resource implements each scheduled operation.
  - Control Logic Extraction - turns loops into FSM driven RTL.
- Viewers:
  - Analyzing Synthesis results:
    - Schedule Viewer: RTL Operations/control and the clock cycle it takes to execute;
    - Function Call Graph Viewer: throughput (latency, initialization) and bottleneck. Finds potential stalls and deadlock.
    - Dataflow Viewer: channel depth impact. Great performance debugging for deadlocks, stalls. Must apply DATAFLOW pragma/directive for this viewer to be populated with the results.
  - Analyzing C/RTL Cosimulation:
    - Timeline Trace Viewer: shows the run time profile of the functions. This is like a bird's eye view of your RTL simulation but applied to C functions telling you how the data is schedules through and moves through your design with respect to clock cycles.
- Abstract-level Parallelism (task-channel, dataflow optimization) vs Instruction-level Parallelism.
- TLP(Task-Level Parallelism): Data-driven Model vs Control-driven Model
  - Data-driven Model (TLP-D)
    - Data is NOT interdependent - A bunch of combinatorial logic that can be pipelined via registers (must not involving any control feedback like sink's READY signal?)
    - Do not require interaction with the outside hierarchy/memory.
    - The task behaves like a FIFO meaning once the data read from the task, the same data cannot be read again (it's not a memory - it's a FIFO).
    - Specified as hls_thread_local hls::stream<int> <vars>;
    - C++ class: ls::task
    - Feedback Design (Cyclical path between Tasks - bursts?), Dynamic Multi-rate Models.
    - Example:
      void f1(){} void f2(){} void f3(){} void dut(int *in, int *out, int n){ hls_thread_local hls::stream<int> s0, s1, s2, s3, s4; // order of KPN tasks does NOT matter hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven hls_thread_local hls::task task1(f2, s1, s2); // data-driven hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven }
  - Control-driven Model (TLP-C; a.k.a. Dataflow Optimization)
    - Data is interdependent:
      - different data arriving for different TCP connection and should not be mixed between the TCP clients.
      - data arrives in parallel in bursts and the parallel bursts are related and not to be mixed. Think of TCP control messages in parallel with their data; data from one control messages shall not me confused for the data for another control message.
      - Think of TDM (Time Division Multiplexing). You canNOT switch packet places.
    - Turns a series of sequential function into a pipelined architecture.
    - Example:
      void diamond(data_t vecIn[N], data_t vecOut[N] ) { hls::stream<int,N> c0,c1,c2,c3,c4,c5; #pragma HLS DATAFLOW Load(vecIn, c0); Compute_A(c0,c1,c2); Compute_B(c1,c3); Compute_C(c2,c4); Compute_D(c3,c4,c5); Store(c5,vecOut); }
  - Mixing Data-driven and Control Drive:
```
void f1(){}
void f2(){}
void f3(){}
void dut(int *in, int *out, int n){
#pragma HLS dataflow
    read_in(in, n, s0);
    // s3 is feedback, wound not easily work in csim
    hls_thread_local hls::stream<int> s0, s1, s2, s3, s4;
    // order of KPN tasks does NOT matter
    hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven
    hls_thread_local hls::task task1(f2, s1, s2);     // data-driven
    hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven 
    write_out(s4, out, n);
}
```
- HLS Directives - Pragmas:
  - Default (# of cycles = # of loops x clock cycle per iteration; TRAIN) - loops are rolled.
  - UNROLL (1 clock cycle; STACK) - unrolls or partially unrolls FOR loops
  - PIPELINE (# of cycles = # of loops + clock cycle per iteration; STEP) - allows parallel execution of the operations in a single loop. Just like sw pipeline with car wash example - the loop does not wait to complete to load next data .
  - DATAFLOW - parallel execution of multiple loops/functions. FIFOs and Ping-Pong buffers (BRAM)
- Performance Metrics:
  - Initiation Interval - # of clock cycles between new input samples. Most critical performance metric for latency.
  - Loop latency - # of clock cycles required to execute all the iterations of the loop.
- Arrays
  - Memory is off-chip, BRAM, interface of the module is synthesized to work with memory.
  - BIND_STORAGE specifies whether the Single Port BRAM (default) or Dual Port BRAM is used
  - Array partitioning - split array to match the max word size (72 bits) supported by the BRAM.
    - block partitioning, the original array is split into equally sized blocks of consecutive elements of the original array as shown here.
    - cyclic partitioning, the original array is split into equally sized blocks interleaving the elements of the original array.
    - complete partitioning, the default operation is to split the array into its individual elements. This corresponds to resolving memory into registers.
  - Array reshaping (ARRAY_RESHAPE directive) - Array reshaping combines the array elements into wider containers, and it allows more data to be accessed in a single clock cycle. So like an array of 8 bit elements placed into a 72x512 BRAM so we can access 9 8-bit words at once?
- v++ (synthesis) and vitis-run (simulation/cosimulation/implementation/export IP|XO) commands.
  - source settings64.sh
  - vitis, vitis -g, or vitis -w
  - v++ (UG1393)
    - v++ --compile (launched HLS or AI Engine compiler modes)
    - v++ --link (Links PL or AI Enginer Kernel files for creating bin files)
    - v++ --package (packages all bin files and boot files into SD card image)
    - v++ --config arg (specifies configuration file).
    - v++ --work_dir (specifies working dir).
    - v++ -c mode hls (used to synthesize HLS).
    - v++ -c mode hls --config hls_config.cfg --work_dir dct
    - Config File:
      - part option specifies a target device for the HLS component. Note that if you use the platform option instead of the part option, then you must also specify a freqhz option instead of a clock option to change the default clock frequency of the platform.
      - clock option specifies the clock period in ns or MHz (ns is the default). If no period is specified, a default period of 10 ns is used.
      - flow_target option sets the flow target. Set the value to vitis to synthesize as a Vitis™ kernel (.xo file) or set it to vivado to synthesize as Vivado™ IP.
      - syn.file and tb.file options specify the file path and name of source files and test bench source files, respectively.
      - clock_uncertainty option specifies how much of the clock period is used as a margin by HLS. It is defined in ns or as a percentage of the clock period. The clock uncertainty defaults to 27% of the clock period.
  - vitis-run - used to enable C simulation, C/RTL cosimulation, and Vivado™ implementation of an HLS component.
    - vitis-run --mode hls
    - vitis-run --csim (run C simulation on HLS component).
    - vitis-run --cosim (run C/RTL cosimulation on HLS component).
    - vitis-run --impl (run Vivado Implementation out-of-context OOC on HLS component).
    - vitis-run --tcl (run Vitis HLS using tcl like batch mode)
    - vitis-run --work_dir (specified work dir for -cosim and -impl; specified dir must contained compiled HLS component).
    - vitis-run --config arg (see v++)
  - Here are some HLS component development steps:
    - Run C simulation using the vitis-run command: vitis-run --mode hls --csim --config hls_config.cfg --work_dir dct
    - Run C synthesis using the v++ command: v++ -c --mode hls --config hls_config.cfg --work_dir dct
    - Run C/RTL cosimulation using the vitis-run command: vitis-run --mode hls --cosim --config hls_config.cfg --work_dir dct
    - Run implementation using the vitis-run command: vitis-run --mode hls --impl --config hls_config.cfg --work_dir dct
    - Export the HLS component output using the vitis-run command: vitis-run --mode hls --package --config hls_config.cfg --work_dir dct
      - package.output.format=<xo|ip_catalog|syn_dcp|sysgen|rtl>
- Directives:
  - Using directives is an alternative way of using pragmas in the source code.
  - add either through the GUI or using the command line with syn.directive.= commands written in the configuration file
    - Syn.directive.pipeline=dct2d II=4
      - DIRECTIVE: PIPELINE,
      - LOCATION: FUNCTION, LOOP, REGION, VARIABLE.
      - ARG: depends on the DIRECTIVE. allow you to customize the synthesis results for the same source code across multiple implementations (similar to Strategies in the Vivado).

    # Lab 2 notes (vitis -w $TRAINING_PATH/hls_command_line/lab)

    # Change dir to the "hls_command_line/lab/dct"
    cd dsp-hls-2023.2-rev1-lab_files/training/hls_command_line/lab/dct/

    # Set up the config file:
    vim hls_config.cfg 
    
    vitis-run --mode hls --csim    --config hls_config.cfg --work_dir dct   # Run C Simulation & 
    v++ -c    --mode hls           --config hls_config.cfg --work_dir dct   # Synthesize the C
    vitis-run --mode hls --cosim   --config hls_config.cfg --work_dir dct   # Run Cosimulation:
    vitis-run --mode hls --package --config hls_config.cfg --work_dir dct   # Package the IP
    vitis-run --mode hls --impl    --config hls_config.cfg --work_dir dct   # Run Implementation/Place/Route:

HLS - alex-aleyan/xilinx GitHub Wiki

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️