HLS - alex-aleyan/xilinx GitHub Wiki
- Academy - High-Level Synthesis with the Vitis Unified IDE
- Labs
- Lab 2 GUI: skip Step 4 to Step 5, finish and come back to Step 4.
- Explore another flavor of this tool not covered in this HLS training - Vitis Unified IDE
- Vitis Kernel Flow (HLS is Vivado IP Flow) for AI Engine graph applications in Heterogeneous Compute Systems
- UG1394 Vitis Kernel Flow.
- UG1393 Vitis Unified Software Platform Documentation: Application Acceleration Development.
- Academy - Accelerating Applications with the Vitis Unified Software Environment
- Academy - Embedded Heterogeneous Design
- Academy - Designing with Versal AI Engine: Architecture and Design Flow - 1
- Academy - Designing with Versal AI Engine: Graph Programming with AI Engine Kernels - 2
- Academy - Designing with Versal AI Engine: Kernel Programming and Optimization - 3
- Academy - Designing with Versal AI Engine: DSP Applications
- constructs such as dynamic memory allocation, file I/O, and recursive type are not supported.
- HLS directives allows to exploration in very short time.
- The generated RTL can be used in Vivado IP Integrator, Vitis Model Composers, Vitis Development Environment.
- Use selfchecking C Testbench (
return 0
if pass, NON-ZERO if fail). - C Simulation vs C Simulation with Vitis HLS Code Analyzer.
- HLS Viewers terminology:
-
Scheduling and Binding:
- Scheduling - schedules operations w.r. to clock cycles. Pipelines your C code.
- Binding (mapping) - determines which HW resource implements each scheduled operation.
- Control Logic Extraction - turns loops into FSM driven RTL.
-
Scheduling and Binding:
- Viewers:
- Analyzing Synthesis results:
- Schedule Viewer: RTL Operations/control and the clock cycle it takes to execute;
- Function Call Graph Viewer: throughput (latency, initialization) and bottleneck. Finds potential stalls and deadlock.
- Dataflow Viewer: channel depth impact. Great performance debugging for deadlocks, stalls. Must apply DATAFLOW pragma/directive for this viewer to be populated with the results.
- Analyzing C/RTL Cosimulation:
- Timeline Trace Viewer: shows the run time profile of the functions. This is like a bird's eye view of your RTL simulation but applied to C functions telling you how the data is schedules through and moves through your design with respect to clock cycles.
- Analyzing Synthesis results:
- Abstract-level Parallelism (task-channel, dataflow optimization) vs Instruction-level Parallelism.
-
TLP(Task-Level Parallelism): Data-driven Model vs Control-driven Model
-
Data-driven Model (TLP-D)
- Data is NOT interdependent - A bunch of combinatorial logic that can be pipelined via registers (must not involving any control feedback like sink's READY signal?)
- Do not require interaction with the outside hierarchy/memory.
- The task behaves like a FIFO meaning once the data read from the task, the same data cannot be read again (it's not a memory - it's a FIFO).
- Specified as
hls_thread_local hls::stream<int> <vars>;
- C++ class:
ls::task
- Feedback Design (Cyclical path between Tasks - bursts?), Dynamic Multi-rate Models.
- Example:
void f1(){} void f2(){} void f3(){} void dut(int *in, int *out, int n){ hls_thread_local hls::stream<int> s0, s1, s2, s3, s4; // order of KPN tasks does NOT matter hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven hls_thread_local hls::task task1(f2, s1, s2); // data-driven hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven }
-
Control-driven Model (TLP-C; a.k.a. Dataflow Optimization)
- Data is interdependent:
- different data arriving for different TCP connection and should not be mixed between the TCP clients.
- data arrives in parallel in bursts and the parallel bursts are related and not to be mixed. Think of TCP control messages in parallel with their data; data from one control messages shall not me confused for the data for another control message.
- Think of TDM (Time Division Multiplexing). You canNOT switch packet places.
- Turns a series of sequential function into a pipelined architecture.
- Example:
void diamond(data_t vecIn[N], data_t vecOut[N] ) { hls::stream<int,N> c0,c1,c2,c3,c4,c5; #pragma HLS DATAFLOW Load(vecIn, c0); Compute_A(c0,c1,c2); Compute_B(c1,c3); Compute_C(c2,c4); Compute_D(c3,c4,c5); Store(c5,vecOut); }
- Data is interdependent:
- Mixing Data-driven and Control Drive:
void f1(){} void f2(){} void f3(){} void dut(int *in, int *out, int n){ #pragma HLS dataflow read_in(in, n, s0); // s3 is feedback, wound not easily work in csim hls_thread_local hls::stream<int> s0, s1, s2, s3, s4; // order of KPN tasks does NOT matter hls_thread_local hls::task task1(f1, s0, s1, s3); // data-driven hls_thread_local hls::task task1(f2, s1, s2); // data-driven hls_thread_local hls::task task1(f3, s2, s3, s4); // data-driven write_out(s4, out, n); }
-
Data-driven Model (TLP-D)
- HLS Directives - Pragmas:
- Default (# of cycles = # of loops x clock cycle per iteration; TRAIN) - loops are rolled.
- UNROLL (1 clock cycle; STACK) - unrolls or partially unrolls FOR loops
- PIPELINE (# of cycles = # of loops + clock cycle per iteration; STEP) - allows parallel execution of the operations in a single loop. Just like sw pipeline with car wash example - the loop does not wait to complete to load next data .
- DATAFLOW - parallel execution of multiple loops/functions. FIFOs and Ping-Pong buffers (BRAM)
- Performance Metrics:
- Initiation Interval - # of clock cycles between new input samples. Most critical performance metric for latency.
- Loop latency - # of clock cycles required to execute all the iterations of the loop.
- Arrays
-
Memory is off-chip, BRAM, interface of the module is synthesized to work with memory.
-
BIND_STORAGE specifies whether the Single Port BRAM (default) or Dual Port BRAM is used
-
Array partitioning - split array to match the max word size (72 bits) supported by the BRAM.
- block partitioning, the original array is split into equally sized blocks of consecutive elements of the original array as shown here.
- cyclic partitioning, the original array is split into equally sized blocks interleaving the elements of the original array.
- complete partitioning, the default operation is to split the array into its individual elements. This corresponds to resolving memory into registers.
-
Array reshaping (ARRAY_RESHAPE directive) - Array reshaping combines the array elements into wider containers, and it allows more data to be accessed in a single clock cycle. So like an array of 8 bit elements placed into a 72x512 BRAM so we can access 9 8-bit words at once?
-
- v++ (synthesis) and vitis-run (simulation/cosimulation/implementation/export IP|XO) commands.
- source settings64.sh
- vitis, vitis -g, or vitis -w
- v++ (UG1393)
-
v++ --compile (launched HLS or AI Engine compiler modes)
-
v++ --link (Links PL or AI Enginer Kernel files for creating bin files)
-
v++ --package (packages all bin files and boot files into SD card image)
-
v++ --config arg (specifies configuration file).
-
v++ --work_dir (specifies working dir).
-
v++ -c mode hls (used to synthesize HLS).
-
v++ -c mode hls --config hls_config.cfg --work_dir dct
-
Config File:
- part option specifies a target device for the HLS component. Note that if you use the platform option instead of the part option, then you must also specify a freqhz option instead of a clock option to change the default clock frequency of the platform.
- clock option specifies the clock period in ns or MHz (ns is the default). If no period is specified, a default period of 10 ns is used.
- flow_target option sets the flow target. Set the value to vitis to synthesize as a Vitis™ kernel (.xo file) or set it to vivado to synthesize as Vivado™ IP.
- syn.file and tb.file options specify the file path and name of source files and test bench source files, respectively.
- clock_uncertainty option specifies how much of the clock period is used as a margin by HLS. It is defined in ns or as a percentage of the clock period. The clock uncertainty defaults to 27% of the clock period.
-
- vitis-run - used to enable C simulation, C/RTL cosimulation, and Vivado™ implementation of an HLS component.
- vitis-run --mode hls
- vitis-run --csim (run C simulation on HLS component).
- vitis-run --cosim (run C/RTL cosimulation on HLS component).
- vitis-run --impl (run Vivado Implementation out-of-context OOC on HLS component).
- vitis-run --tcl (run Vitis HLS using tcl like batch mode)
- vitis-run --work_dir (specified work dir for -cosim and -impl; specified dir must contained compiled HLS component).
- vitis-run --config arg (see v++)
- Here are some HLS component development steps:
-
Run C simulation using the vitis-run command: vitis-run --mode hls --csim --config hls_config.cfg --work_dir dct
-
Run C synthesis using the v++ command: v++ -c --mode hls --config hls_config.cfg --work_dir dct
-
Run C/RTL cosimulation using the vitis-run command: vitis-run --mode hls --cosim --config hls_config.cfg --work_dir dct
-
Run implementation using the vitis-run command: vitis-run --mode hls --impl --config hls_config.cfg --work_dir dct
-
Export the HLS component output using the vitis-run command: vitis-run --mode hls --package --config hls_config.cfg --work_dir dct
- package.output.format=<xo|ip_catalog|syn_dcp|sysgen|rtl>
-
- Directives:
- Using directives is an alternative way of using pragmas in the source code.
- add either through the GUI or using the command line with syn.directive.= commands written in the configuration file
- Syn.directive.pipeline=dct2d II=4
- DIRECTIVE: PIPELINE,
- LOCATION: FUNCTION, LOOP, REGION, VARIABLE.
- ARG: depends on the DIRECTIVE. allow you to customize the synthesis results for the same source code across multiple implementations (similar to Strategies in the Vivado).
- Syn.directive.pipeline=dct2d II=4
- Labs
# Lab 2 notes (vitis -w $TRAINING_PATH/hls_command_line/lab)
# Change dir to the "hls_command_line/lab/dct"
cd dsp-hls-2023.2-rev1-lab_files/training/hls_command_line/lab/dct/
# Set up the config file:
vim hls_config.cfg
vitis-run --mode hls --csim --config hls_config.cfg --work_dir dct # Run C Simulation &
v++ -c --mode hls --config hls_config.cfg --work_dir dct # Synthesize the C
vitis-run --mode hls --cosim --config hls_config.cfg --work_dir dct # Run Cosimulation:
vitis-run --mode hls --package --config hls_config.cfg --work_dir dct # Package the IP
vitis-run --mode hls --impl --config hls_config.cfg --work_dir dct # Run Implementation/Place/Route: