Reproducibility - martinkiefer/Scotch GitHub Wiki
Reproducibility
This page explains how the experiments in the paper 'Scotch: Generating FPGA-Accelerators for Sketching at Line Rate' (publication pending) can be reproduced / how all artifcats are used.
First, you need to make sure that you satisfy all requirements for this project listed on the project main page. Baselines require GCC/CUDA.
Second, make sure that you have generated the parser and, if you run our baselines, generate input data.
In the following, we assume a Quartus Prime / Vivado installation pointed to by an environment variables (with the required licenses, obviously). For example:
Xilinx
PRODUCT=Vivado
TOOLCHAIN=/data/Xilinx/Vivado/2020.1/bin
Quartus Prime Pro
PRODUCT=Quartus
TOOLCHAIN=/data/intelFPGA_pro/19.3/quartus/bin/quartus
Automated Tuning
These experiments have quite a few parameters:
- Toolchain+FPGA,
- Degree of Data Parallelism
- Operating Frequency
- Algorithm (+ number of rows / columns)
But, fortunately, after you have things set up it's just a call to autotune.py and you can do something else for a while.
Selecting an I/O template
First, you have to prepare an approritate I/O controller. First, pick the FPGA, the algorithm, and the degree of data parallelism.
Given that, the following table will help you to pick the appropriate I/O controller from the I/O controller folder. It consists of three parts: A prefix, the degree of data parallelism, and an optional suffix.
FPGA | PREFIX |
---|---|
A10 | A10-10AX115S2F45I1SG-dummy-superior-flex- |
S10 | S10-1SG280HU2F50E2VG-dummy-superior-flex- |
XUS | xcvu29p-fsga2577-2L-e-flex- |
XUS+ | xcvu7p-flva2104-3-e-flex- |
Next, you append the degree of data parallelism DP
(1,4,8,16). If your target algorithm is gonna be HLL you want to append the suffix "-32-6".
Let's say you want to do AGMS on Arria 10 with a data parallelism degree of 4, then you want the I/O controller A10-10AX115S2F45I1SG-dummy-superior-flex-4
.
CONTROLLER=../IO-Controllers/A10-10AX115S2F45I1SG-dummy-superior-flex-4
The I/O templates have a target operating frequency of 400 MHz. If you want to run a different operating frequency, change the FMAX
variable in the compile.sh
script in the I/O controller folder.
Select a Sketch Descriptor
Our ScotchDSL implementations are located in the /Sketches folder. Depending on whether you want to do Map-Apply (column-sketches only) or Select-Update (all kinds of sketches), you want to check the corresponding subfolder. If you are using Map-Apply, be sure that the data parallelism selected for the I/O controller and the algorithm (DP suffix)
The argument to the Autotune Algorithm is the sketch descriptor in the corresponding folder. So, if you want to do AGMS with a data parallelism degree of 4 using Map-Apply, your descriptor is gonna be:
DESCRIPTOR=../Sketches/Map-Apply/AGMS-DP4/descriptor.json
Run the Autotune Algorithm
Now we have to fix the remaining parameters of the autotune algorithm.
First, we have to select the code generation mode:
Sketch Type | Mode |
---|---|
Column Sketch (Select-Update) | export MODE=column |
Column Sketch (Map-Apply) | export MODE=column-ma |
Matrix / Row Sketch (Select-Update) | export MODE=matrix |
If you are running in matrix mode, you will have to select the depth of the memory segments according to the BRAM blocks on the FPGA used. Here are the ones we used for our experiments:
FPGA | Sketch | |
---|---|---|
A10, S10 | HLL | export DEPTH=2048 |
A10, S10 | others | export DEPTH=512 |
XUS+ | all | export DEPTH=4096 |
XUS | CM | export DEPTH=1024 |
For column sketches, you then run:
python3 autotune.py plain -generator $MODE -descriptor_path $DESCRIPTOR -dp $DP -template_path $CONTROLLER -toolchain $PRODUCT -toolchain_prefix $TOOLCHAIN -toolchain -cfactor 4 -dfactor 4 -initial_guess 16
Then, the number of rows is optimized.
For row and matrix sketches with a number of rows NROWS
(fixed to one for row sketches), you have to call:
python3 autotune.py plain -generator $MODE -descriptor_path $DESCRIPTOR -memseg_depth $DEPTH -dp $DP -template_path $CONTROLLER -toolchain $PRODUCT -toolchain_prefix $TOOLCHAIN -fixrows NROWS -cfactor 4 -dfactor 4 -initial_guess 16
Then, the number of columns is optimized. You may also optimize the number of columns by using -fixcols
instead.
Note that each individual experiment takes several days so you definitely want to start the command in a tmux session.
Evaluation
Let autotune run for all parameters and collect the summary sizes for the generated accelerators. The throughput is computed as the degree of data parallelism times the operating frequency times 32 as we are always using 32 Bit input values and processing rates are static. Compare them to the baseines.
Baselines
All baseline implementations are given in the /Baselines folder. For each algorithm, we provide one CPU variant and at least one GPU variant.
CPU (SIMD+OpenMP)
The CPU baselines are straightforward to use and to evaluate against. Each core operates on a sequential chunk of the input data and constructs it's own sketch similar to the replication performed in Scotch. As CPUs have sufficiently large per-core caches, this implementation strategy works well for all summary sizes covered in our evaluation. We use OpenMP for multi-threading and AVX512 instructions with GCC vector types, so be sure that your CPU is recent enough for AVX512.
For MinHash and AGMS, execute
./sketch $num_rows $num_threads
For all the others:
./sketch $num_rows $num_cols $num_threads
Set the number of threads to the number of hardware threads available on your CPU. For us, it was 48 (2x12+HT). For row sketches (FC, HLL), you want to set $num_rows to 1. The executables generate a CSV line that reports on the parameters and the observed throughput. In the paper, we report the throughput over 10 iterations compared to our FPGA accelerators.
GPU (CUDA)
GPU implementation strategies are little more diverse as hardware caches are much smaller and can make a difference.
For AGMS and HLL, each thread builds one counter of the sketch over a chunk of the input data. The sketches are always bound by compute throughput so there was not a lot to gain by tuning. For these algorithms, you execute
./sketch $num_rows
For CM and FAGMS, we provide two implementation strategies.
Row-by-Row: For each row, all threads cooperatively make one coalesced pass over the input data and update the row using atomics. This implementation strategy works well for sketches with a large number of columns as collisions are less likely and cache locality is increased.
Row-Parallel: Each thread is assigned to one particular row and a stride of the input data. It updates the row cooperatively with the other threads using atomics. This implementation strategy works well for smaller sketches as all rows fit into the cache simultaneously and the chance of collision is decreased. For these algorithms, you execute:
./sketch $num_rows $num_cols
In our evaluation, we report the throughput for the strategy that provided the highest throughput for the given setup.
For HLL and FC, the Row-Parallel strategy is not available as there is only one row. Instead, our implementation builds a configurable number of replicas (similar to the CPU implementation). If the number of replicas is one, this equivalent to Row-by-Row, otherwise it is similar to Row-Parallel. You execute:
./sketch $replicas $num_cols
As the number of columns is usually pretty large in our experiments, setting the number of replicas to one is almost always the best strategy. We performed the experiments for a number of replicas up to four and reported the best throughput.
All implementations print a CSV string containing the parameters and the observed throughput. As we measure performance with sketch and data residing in device memory, we also report on the transfer bandwidth in case the throughput exceeds it. We measured the transfer bandwidth by averaging over the numbers reported by ten calls to the bandwidthTest shipped with CUDA.
Code Generator
In these experiments, we scale the size of the sketch without automated tuning and report on the maximum operating frequency and resource consumption. We provide scripts you will consider helpful in the Scaling_Experiments folder.
Sketch Type | Script |
---|---|
Column Sketch (Select-Update) | run_column_ma.sh |
Column Sketch (Map-Apply) | run_column_su.sh |
Matrix / Row Sketch (Select-Update) | run_matrix_su.sh |
The sketch, parameters, and I/O Controller can be set by modifying the variables at the top of the script. They are pretty much identical to the ones passed to Autotune. You have to run the script from within the Scaling_Experiments folder.
The scripts keep increasing the sketch size until compilation runs out of FPGA resources for the first time. Compiled projets end up in the ./work folder. For resource consumption, see the corresponding resource report. To compute the maximum operating frequency, you have to check the maximum negative setup slack in the timing report and adjust the target operating frequency set inside the script accordingly.
Energy Consumption
We have differentiate between FPGAs and our baselines here.
FPGAs
For FPGAs, we use the Intel Early Power Estimator to get an estimate on the power consumption. First, you have to start autotune for the given sketch on S10 with d=8 and f=390.625. You grab the last successful project folder from ./work and open in in Quartus Prime. Klick Project, Generate Early Power Esitmator File. Go grab a hot beverage. When you are back, the file will be ready. You load the file into the S10 Early Power Estimator Spreadsheet. We use the reported total energy consumption reported on the title page of the spreadsheet. Of course, you are also free to play around with A10 as well.
CPU Baseline
Here, we measure the energy consumption of the CPU baseline with the sketch size found for the FPGA accelerator by autotune. We use Powerstat with Intel's RAPL interface. Measuring the performance is easier if the implementation runs in an endless loop. Comment in the while(1)
endless loop provided in a comment in every source file.
Once the implementation is running, call powerstat -R
simultaneously, leave it running for a minute. We report the average reported when terminating powerstat.
GPU Baseline
Here, we measure the energy consumption of the GPU baseline with the sketch size found for the FPGA accelerator by autotune. We measure energy consumption using nvidia-smi. Measuring the performance is easier if the implementation runs in an endless loop. Comment in the while(1)
endless loop provided in a comment in every source file.
Once the implementation is running, call nvidia-smi --format=csv --query-gpu=power.draw -l 1
, leave it running for a minute. We report the average of the reported values.
Handwritten
Generate an accelerator using autotune as described in the Automated Tuning Section. The accelerator has to be CM with five rows on XUS with a data parallelism degree of one. The target operating frequency has to be set to 497 MHz.