DNNDK v3.1 package workflow - UviDTE-FPSoC/Zynq7000-dnn-inference GitHub Wiki

This section tends to explain how a custom application has to be created and configured in order to be able to be executed in the ZedBoard using the DNNDK v3.1 package. All the necessary steps are going to be indicated for both Caffe and Tensorflow frameworks, following the DNNDK v3.1 package User Guide. The workflow for both frameworks is basically the same, except for the network compression step.

Download a pretrained model
Download calibration and evaluation images
Prepare test dataset for target board
Network Compression
- Caffe workflow
- TensorFlow workflow
Network Compilation
- DLet
- Compilation
Programing with DNNDK APIs
DPU Hybrid Compilation
- DPU Shared Library

Download a pretrained model

To download a pre-trained model you can use the model zoo repository. This repository contains stat of the art convolutional neural networks for some of the most popular frameworks. These include Caffe and TensorFlow. The downloaded archives contain the files with the models themselves and a README file with information on the rest of folders the archive includes. The most important is the last section, where the image pre-processing the networks were trained with is described.

Download calibration and evaluation images

The execution of all the examples needs a calibration data set of 100 to 1000 images that can be downloaded from the ImageNet dataset here. In this page you can download a 147 GB file with training images, which you don't need for the DNNDK package, or a 6.74 GB file with validation images. This smaller set should be downloaded, and can be done here. The .tar arquive you can download here contains up to 50000 images. The problem with this images is that there is no .txt file with them that contains a list of all the images with no labels. We are going to create this list with a python script. The content of this file would be the following:

# -*- coding: utf-8 -*-

def main():

    # Open the file for writing and create it if it doesn't exist
    f = open("imagenet_calib.txt","w+")

    # Write the name of the images from 1 to 1000
    i = 1
    while i<10:
        f.write("ILSVRC2012_val_0000000{}.JPEG\n".format(i))
        i = i + 1

    while i<100:
        f.write("ILSVRC2012_val_000000{}.JPEG\n".format(i))
        i = i + 1

    while i<1000:
        f.write("ILSVRC2012_val_00000{}.JPEG\n".format(i))
        i = i + 1

    while i<5001:
        f.write("ILSVRC2012_val_0000{}.JPEG\n".format(i))
        i = i + 1

    #Close the file when finished
    f.close()


if __name__== "__main__":
    main()

You can copy this text to a <name_of_the_file>.py file and create the imagenet_calib.txt file by running the command python <name_of_the_file>.py in the terminal.

To download both the validation data list and training data list for both Imagenet datasets, with lists containing the labels, is using the script here.

Prepare test dataset for target board

A dataset is needed in order to execute inference in the target board, ZedBoard. For our applications, we create a subset of images obtained from the ImageNet2012 dataset, with a .txt file where the names of the images and its labels are contained.

In order to use the subset with ZedBoard, create a new folder named test_images and copy 500 images of the ImageNet validation dataset into it, with a list of their names and label. This validation data list can be downloaded as shown in the previous section. The final step is to add another .txt file which contains all the classes of the ImageNet dataset. The words.txt file at the <dnndk-v3.1-package_download_directory>/common/image500_640_480/ directory contains the names of all this classes.

To copy the subset into ZedBoard, introduce the following command:

sudo scp -r ./test_images [email protected]:~/xilinx-dnndk-v3.1/ZedBoard/

Network Compression

Network compression consists in down sizing a DNN model in order to reduce the memory usage needed by the target device when executing inference. The main compression techniques are pruning and quantization, and to execute this techniques it is necessary to use the DECENT Conda environment that was created in the Setting up the host sub-section of the Software Installation page. In any case, previous to the execution of these techniques, it is necessary to prepare the model.

The DECENT_Q environment needs to create a series of files to be able to properly execute pruning and quantization. This files would be the frozen_graph.pb, the calibration dataset and the Input_fn.

NOTE: The pruning tool is not mentioned in this guide as it has a license requirement we had no access to

In the remaining of the subsection, the workflow for compression of Caffe and TensorFlow models is explained separately.

Caffe workflow

The compression workflow for a Caffe model requires less steps than TensorFlow. When using this framework, there is two steps to be made, creating a file that performs the calibration images' pre-processing, and quantize the model to 8-bit fixed points so that the DPU can execute the kernel of the model.

Calibration Images Directory

The .prototxt file of the Caffe model contains the image pre-processing needed to calibrate the model after quantization. The only change that has to be made to this file is indicating the path to the calibration images, the root_folder field, and the .txt file containing the list of the images, source field. The code showed next appears twice in the .prototxt files.

...

image_data_param {
    source: "/media/arroas/HDD/MinhasCousas/EEI/Mestrado/2_Curso/TFM/Inference_Images/calibration_data/imagenet_calib.txt"
    root_folder: "/media/arroas/HDD/MinhasCousas/EEI/Mestrado/2_Curso/TFM/Inference_Images/calibration_data/imagenet_images/"
    batch_size: 10
    shuffle: true
  }

...

Calibration is commonly performed with 100 to 1000 images. These images aren't pre-processed all together though. Usually, at the quantization step you can select the number of iterations of calibration. This indicates how many iterations of the calibration sequence are done. In the previous field, as there is the indication to use a batch_size of 10 images, the total calibration images used will be the product of the total calibration iterations indicated and this batch size.

Caffe Quantization

The model downloaded from the AI Model-Zoo repository contains the caffe floating point network model float.prototxt, and a float.caffemodel file which contains the weights of the neural network.

Quantization execution is performed in a script, decent_q.sh, with the following content:

#!/usr/bin/env bash

#working directory
work_dir=$(pwd)
#path of float model
model_dir=${work_dir}/float
#output directory
output_dir=${work_dir}/decent_output

if [ -f /usr/local/bin/decent ]; then
    DECENT="decent"
elif [ -f /usr/local/bin/decent-cpu ]; then
    DECENT="decent-cpu"
else
    echo "Error: Please run DNNDK host_x86/install.sh first to install decent"
    exit 1
fi

[ -d "$output_dir" ] || mkdir "$output_dir"

$DECENT    quantize                               \
           -model ${model_dir}/trainval.prototxt     \
           -weights ${model_dir}/trainval.caffemodel \
           -output_dir ${output_dir}              \
           -method 1

model: this model has to indicate the floating point prototxt file.
weights: indication to the .caffemodel file containing the weigths of the model.

These two are the only mandatory fields to be included when calling the decent tool for a Caffe model. There is several other optional parameters. The most important ones are indicated:

output_dir: used to specify the directory where the quantization tool saves its output model.
weights_bit: field indicates the bit width for weight and bias. Default is 8.
data_bit field indicates the bit width for quantized activation. Default is also 8.
The method field can be set to 0 to 1, indicating the method of quantization. Zero stands for non-overflow method, to make sure no values are saturated during quantization, but it might get worse results in the case of outliers. One stands for min-diffs method, allowing saturation to get lower quantization difference and higher tolerance to outliers. Usually ends up with narrower ranges than non-overflow.
calib_iter: indicates how many times the calibration function is executed. If the batch size in the quantization tool is set to unknown, being this value specified in the input_fn script, for each iteration performed, the calibration function will pre-process as many images as the batch size indicates. The total of images calibration is done with would therefore be the product between calib_iter and calib_batch_size.
gpuis where you can indicate the gpu's id when using the decent environment for this device. If you aren't using gpu, you can set this field to 0.

Output

Once quantization is successful, two files are generated in the output_dir.

deploy.prototxt. Quantized model to later use with the compilation tool.
deploy.caffemodel. Quantized weights of the model.

TensorFlow workflow

A TensorFlow model has a different working flow than a Caffe model. The whole process of creating a custom application with a pre-trained TensorFlow model is now explained.

Freeze the network

A frozen graph file contains a pre-trained DNN model with all its variables converted to constant values in order to merge the floating point model .pb and the exact values of all the parameters of the network .ckpt into one single file. This file is created from the .pb file given by the pre-trained model, and a set of checkpoint files, .ckpt. In order to handle this conversion, TensorFlow provides a freeze_graph.py script which is installed with DECENT_Q. To use this tool, you can execute the following commands or copy them into a .sh file, in order to execute them all together.

NOTE: If you are using a model of the AI Model-Zoo Xilinx repository, the .pb files that are donwloaded have already been frozen. You should therefore skip this step.

$ freeze_graph \
      --input_graph /tmp/inception_v1_inf_graph.pb \
      --input_checkpoint /tmp/checkpoints/model.ckpt-1000 \
      --input_binary true \
      --output_graph /tmp/frozen_graph.pb \
      --output_node_names InceptionV1/Predictions/Reshape_1

NOTE: To see all the options of the freeze tool, execute the freeze_graph --help command.

The input_graph and input_checkpoint fields have to be filled up, respectively, with a .pb and a .ckpt model, which are the result of training a neural network. In this guide we always use already trained models, so these two files are always given as the starting point of an application.

The input_binary field is not explained in the DNNDK User Guide, but it is always set to True in the User Guide's examples.

One of the fields that has to be covered in this step is the --output_node_names. Later on, we will also be using the filed --input_node_names. The input and output nodes are the name list of input and output nodes, comma separated, that indicate the start and end points of quantization. The subgraph between them will be quantized if quantizable. It is recommended to place the input nodes at the last parto of the pre-processing stage, and the output nodes at the beggining of the post-procesing stage, as these two parts might have some operators that aren't quantizable amd cam caise errors. The definition of both these parameters can be obtained in the DNNDK User Guide, pages 57-58.

In order to check the possible input and output node names of the model, which they are necessary to include in the freeze_graph.sh script, you can estimate them using the following command with your pre trained model.

$ decent_q inspect --input_frozen_graph=/tmp/inception_v1_inf_graph.pb

The output of this command would give you a name you can use to fill up the input_node_names and output_node_names that you will need in several of the steps when creating an aplication for a target board.

Calibration dataset and input function

The calibration dataset can be obtained as explained at the end of section Network Deployment of DNNDK host examples at the Inference of DNNs with the DNNDK package page.

The input_fn field of the quantization tool should take a int object as input, indicating the calibration step number, and should return a dict(placeholder_name, numpy.Array) object for each call, which is fed into the model's placeholder nodes when running inference. The shape of numpy.array should be consistent with the placeholders. The pseudo code example looks like below:

$ “my_input_fn.py”
def calib_input(iter):
  “””A function that provides input data for the calibration
  Args:
  iter: A `int` object, indicating the calibration step number
  Returns:
  dict( placeholder_name, numpy.array): a `dict` object, which will be fed
  into the model
  “””
  image = load_image(iter)
  preprocessed_image = do_preprocess(image)
  return {"placeholder_name": preprocessed_images}

Calibration is commonly performed with 100 to 1000 images. These images aren't preprocessed all together though. Usually, at the quantization step you can select the number of iterations of calibration. This indicates how many times you run the calib_input function when quantizing a model. Often, the functions within the calibration function used to read and preprocess an image are contained in a loop in order to load more than one image at each iteration. The loop would be run as many times as indicated by a user defined variable known as the calib_batch_size, which is defined by the user in the input_fn.py function.

TensorFlow Quantization

The DNN model should be now ready to perform quantization on it. In order to quantize it, we are going to use the decent_q.sh script.

decent_q quantize \
  --input_frozen_graph frozen_inception_v1.pb \
  --input_nodes input \
  --input_shapes ?,224,224,3 \
  --output_nodes InceptionV1/Logits/Predictions/Reshape_1 \
  --input_fn inception_v1_input_fn.calib_input \
  --method 1 \
  --gpu 0 \
  --calib_iter 100 \

The input_frozen_graph field has to be the output of the freeze operation.
The input_nodes and output_nodes fields have to be filled up with the output of the command previously shown when explaining the freeze function, decent_q inspect.
input_shapes field specifies the shape of the input nodes, which must be a four dimension shape for each node. The first dimension would be the batch size, which can be set to unknown ?. By selecting this option, the batch size can be specified in the input_fn.py script as previously specified. The two numbers in the middle, 224x224, indicate the pixel size of the input images, and the last number indicates the amount of layers of the input node. In the examples of section Network Deployment of DNNDK host examples, images were formated to RGB, therefore there is only 3 layers.
input_fn indicates a script that contains the preprocessing routine of the application, as this step cannot be executed in the DPU IP estructure. The preprocessing operations can be added to a python script .py, but when indicating the script at the quantize tool, the extension at the quantize call should be .calib_input. Do not change the .py extension in the script though.

These are the main fields that have to be covered in order to perfor the quantization operation. There is other optional fields. They are carefully explained at the DNNDK User Guide, pages 57-59. We now mention some of the most important ones.

weight_bit field indicates the bit width for weight and bias. Default is 8.
activation_bit field indicates the bit width for quantized activation. Default is also 8.
The method field can be set to 0 ro 1, indicating the method of quantization. Zero stands for non-overflow method, to make sure no values are saturated during quantization, but it might get worse results in the case of outliers. One stands for min-diffs method, allowing saturation to get lower quantization difference and higher tolerance to outliers. Usually ends up with narrower ranges than non-overflow.
calib_iter indicates how many times the calibration function is executed. If the batch size in the quantization tool is set to unknown, being this value specified in the input_fn script, for each iteration performed, the calibration fuction will preprocess as many images as the batch size indicates. The total of images calibration is done with would therefore be the product between calib_iter and calib_batch_size.
output_dir is used to specify the directory where the quantization tool saves its output model.
gpuis where you can indicate the gpu's id when using the decent environment for this device. If you aren't using gpu, you can set this field to 0.

Output and evaluation

Once quantization is succesfull, two files are generated in the output_dir.

deploy_model.pb. Quantized model to later use with the compilation tool.
quantized_eval_model.pb. Enables evaluation of the quantized model.

Once quantization is done, an evaluation of the frozen and quantized model can be performed in order to compare the loss in accuracy. This same evaluation could be done in the case of pruning the model. To evaluate the model, we are going to use the python script provided by the DNNDK v3.1 package, in the directory <dnndk-v3.1-package_download_directory>/host_x86/models/tensorflow/resnet_v1_50. The name of the file that performs the evaluation is resnet_v1_50_eval.py. This file can be used with all the models you want to evaluate, as all the parameters the file requires are entered when executing the file. Several examples to call this script can be found in the Inference of AI Model DNNs with the DNNDK package page of the wiki, in the TensorFlow DNN models section.

This script also uses a function from the input_fn.py file, as it needs the images to be pre-processed.

Dump quantize simulation results

Enables comparison of the results between CPU/GPU and the DPU. Decent_q supports the dump functionality using the previously created quantize_eval_model.pb model from quantization.

The dump tool should be executed as follows:

$ decent_q dump \
      --input_frozen_graph quantize_results/quantize_eval_model.pb \
      --input_fn dump_input_fn \
      --max_dump_batches 1 \
      --dump_float 0 \
      --output_dir quantize_reuslts \

At the input_fn field we should indicate a similar script than the one used in quantization, but in this case using a batch size of 1, in order to be consistent with deployment on the DPU. The results of this tool writen in the output_dir. This directory will contain a dump result for each batch of input data.
dump_float is a field that indicates wheter or not to dump unquatized nodes. Zero stands for not dumping this type of node.

For each quantized node, results will be saved in “...int8.bin” and “...int8.txt” format.

Network Compilation

The architecture of the Deep Neural Network Compiler (DNNC) consists of a parser, an optimizer and a code-generator. The front-end parser is responsible for parsing the Caffe/TensorFlow model and generates an intermediate representation (IR) of the input model. The optimizer handles optimizations based on the IR, and the code generator maps the optimized IR to DPU instructions.

The compilation process has two very important steps:

DLet

This tool is used to parse and extract varios DPU configuration parameters from the DPU Hardware Handoff file, HWH, generated by your vivado projet.

The HWH file is located in the following directory, considering that the name of the project created previously in this guide is ZedBoard_DPU_2019_2.

cd /<vivado_project_location>/ZedBoard_DPU_2019_2.srcs/sources_1/bd/design_1/hw_handoff/

In this wiki we shown how to create a PetaLinux project configured with a Vivado project. This Vivado project we created has its own .hwh. A copy of this file can be found in this link in the case that you need it.

NOTE: The desing_1 folder has the name of the block design the DPU was icluded into. If you have several, make sure you select the correct one.

The file is going to have the same name as the block desing the DPU was included into, design_1.hwh in this case. With this file, DLet is able to generate the configuration .dfc file needed by the compiler to corretly create the DPU kernel. To generate this file, enter the directory you want your .dfc file to be writen to, and execute the following commands in the terminal.

dlet -f /<vivado_project_location>/ZedBoard_DPU_2019_2.srcs/sources_1/bd/design_1/hw_handoff/design_1.hwh

The .dfc file is created a name that contains the date the .hwh was created.

NOTE: The DPU IP block used in the Vivado project has to come from the DPU TRD v3.0 or higher in order to be compatible with the DNNDK v3.1 package.

An example is now shown here:

$ dlet -f /media/arroas/HDD/MinhasCousas/EEI/Mestrado/2_Curso/TFM/vitis-dnn/ZedBoard_DNNs/ZedBoard_DPU_2019_2/ZedBoard_DPU_2019_2.srcs/sources_1/bd/design_1/hw_handoff/design_1.hwh

The output should look like this:

[DLet]Generate DPU DCF file dpu-05302000-302000-202005302000-2000-00.dcf successfully.
*** stack smashing detected ***: <unknown> terminated
Aborted (core dumped)

You can rename the .dcf for an easier one, such as custom_zedboard. A copy of this .dcf file is included here for the Vivado project created in the page FPSoC hardware description project.

NOTE: if the .hwh file is corrupted and the DNNC tool doesn't execute properly due to the .dcf file, you can regenerate this file by going to you Vivado project and generating the output products again. This is done in section Generate the bitstream

Compilation

The compilation tool DNNC uses the quantization output models obtained from both Caffe or TensorFlow frameworks. The only difference needed to call the compiler with one or another is to correctly indicate the outputs.

Caffe models require indication of two fields:

prototxt: path to the quantized deploy.prototxt file.
caffemodel: path to the quantized 'deploy.caffemodel' file.

TensorFlow models require indication of only one field:

float_graph: path to the quantized deploy.pb file.

When compiling a model, there are several parameters that have to be indicated:

parser can be filled up with two options, caffe or tensorflow. Depending on the model framework, you have to chose one or the otherone. If using a Caffe model, you have to indicate two more fields, prototxt and caffemodel, where you have to indicate the location of the prototxt and caffemodel files. If using TensorFlow, you have to indicate the frozen_pb field, where you should indicate the location of your deploy_model.pb file.
dfc field indicates the path to the configuration file that was created with the DLet tool.
mode establishes the compilation model of the DPU, which can be debug or normal. The debug option enables to run the layers of the model one by one under the scheduling of the N2Cube. With the DExplorer application, users can perform debugging or performance profiling for each layer. Normal mode packages all the layers of the model into one single DPU execution. With this mode, DPU kernel delibers better performance, and it is recommended for the release phase of an application.
cpu_arch indicates the architecture of the target device. The possibilities are arm32 or arm64.
output_dir establishes the output directory of the compiled model.

There are more parameters that can be set, and they are all specified in the DNNDK User Guide, pages 65-67.

An example of how to create a script to perform compilation with Inceptionv1 and the TensorFlow framework is now shown.

#!/usr/bin/env bash

net="inception_v1"
CPU_ARCH="arm32"
DNNC_MODE="debug"
dnndk_board="ZedBoard"
dnndk_dcf="../dcf/custom_zedboard.dcf"

echo "Compiling Network ${net}"

# Work space directory
work_dir=$(pwd)

# Path of caffe quantization model
model_dir=${work_dir}/quantize_results
# Output directory
output_dir="dnnc_output"

tf_model=${model_dir}/deploy_model.pb

DNNC=dnnc

# Get DNNDK config info
if [ ! -f /etc/dnndk.conf ]; then
    echo "Error: Cannot find /etc/dnndk.conf"
    exit 1
else
    tmp=$(grep "DNNDK_VERSION=" /etc/dnndk.conf)
    dnndk_version=${tmp#DNNDK_VERSION=}
    dnndk_version=${dnndk_version#v}
    echo "DNNDK      : $dnndk_version"
    echo "Board Name : $dnndk_board"
    echo "DCF file   : $dnndk_dcf"
fi

if [ ! -d "$model_dir" ]; then
    echo "Can not found directory of $model_dir"
    exit 1
fi

[ -d "$output_dir" ] || mkdir "$output_dir"

echo "CPU Arch   : $CPU_ARCH"
echo "DNNC Mode  : $DNNC_MODE"
echo "$(dnnc --version)"
$DNNC   --parser=tensorflow                         \
       --frozen_pb=${tf_model}                     \
       --output_dir=${output_dir}                  \
       --dcf=${dnndk_dcf}                          \
       --mode=${DNNC_MODE}                         \
       --cpu_arch=${CPU_ARCH}                      \
       --net_name=${net}

Programing with DNNDK APIs

In this section the programing APIs for the DPU are explainded with detail.

DPU Kernel

The DPU kernel is created with the DNNC tool, after compiling a frozen graph with a given DPU configuration. This operation transforms the neural network model into an equivalent DPU assembly file, which is then assembled into an ELF object. From the prespective of the runtime application, this file becomes the execution unit for N2Cube ater invoquint the API dpuLoadKernel(). N2Cube loads the DPU kernel, including DPU instructions and network parameters into the DPU dedicated memory space, allocating hardware resources. After that, each DPU kernel can be instantiated into several DPU tasks by calling dpuCreateTask() to enable multithreaded programming.

DPU Task

Each DPU task used is a running entity of the DPU kernel. It has its own memory space so that multithreaded applications can be used to process several tasks in paralell.

DPU Node

A DPU node is a basic element of the network. It's associated to an input, output and some parameters. Each node has a unique name and the APIs are able to access their information. There is three tyes of nodes; boundary input or output node and internal node.

Boundary input node doesn't have a precursor in the kernel topology. It is usually the first node of the kernel, and there could be more than one.
Boundary output node doesn have a successor.
The rest of the nodes would be labeled as internal nodes.

After compilation, the DNNC tool gives information about the input and output nodes of each kernel. An example is now displayed.

Compiling Network inception_v1
DNNDK      : 3.1
Board Name : ZedBoard
DCF file   : ../dcf/custom_zedboard.dcf
CPU Arch   : arm32
DNNC Mode  : debug
dnnc version v3.00
DPU Target : v1.4.0
Build Label: Aug  9 2019 05:23:25
Copyright @2019 Xilinx Inc. All Rights Reserved.

[DNNC][Warning] layer [InceptionV1_Logits_SpatialSqueeze] (type: Squeeze) is not supported in DPU, deploy it in CPU instead.
[DNNC][Warning] layer [InceptionV1_Logits_Predictions_Softmax] (type: Softmax) is not supported in DPU, deploy it in CPU instead.

DNNC Kernel topology "inception_v1_kernel_graph.jpg" for network "inception_v1"
DNNC kernel list info for network "inception_v1"
                               Kernel ID : Name
                                       0 : inception_v1_0
                                       1 : inception_v1_1

                             Kernel Name : inception_v1_0
--------------------------------------------------------------------------------
                             Kernel Type : DPUKernel
                               Code Size : 0.26MB
                              Param Size : 6.31MB
                           Workload MACs : 2996.75MOPS
                         IO Memory Space : 0.76MB
                              Mean Value : 0, 0, 0,
                              Node Count : 76
                            Tensor Count : 110
                    Input Node(s)(H*W*C)
InceptionV1_InceptionV1_Conv2d_1a_7x7_Conv2D(0) : 224*224*3
                   Output Node(s)(H*W*C)
InceptionV1_Logits_Conv2d_0c_1x1_Conv2D(0) : 1*1*1001


                             Kernel Name : inception_v1_1
--------------------------------------------------------------------------------
                             Kernel Type : CPUKernel
                    Input Node(s)(H*W*C)
       InceptionV1_Logits_SpatialSqueeze : 1*1*1001
                   Output Node(s)(H*W*C)
  InceptionV1_Logits_Predictions_Softmax : 1*1*1001

The kernel inception_v1_0, which is a DPU kernel, has as input node the InceptionV1_InceptionV1_Conv2d_1a_7x7_Conv2D, and as an output node the InceptionV1_Logits_Conv2d_0c_1x1_Conv2D.

When using the API dpuGetInputTensor(), the nodeName parameter is required to specify the boundary input node. DNNDK will generate an error if a node which is not a boundary input node is indicated in the nameNode field. A similar error will happen when using the dpuGetOutputTensor() API.

DPU Tensor

The DPU tensor is a set of multidimensional data used to store information while running an application. For DPU, memory storage layout for input tensor and output tensor is in the format of HWC (HeightWidthChannel), while a standard image usually has a CHW format (ChannelHeightWidth). This is important when inputing or retreiving information from the DPU.

Applications can be created using C/C++ APIs, for which it is necessary to create the pre and post processing routines as well as the main application. In this release though, there is the posibility of using Python APIs as well, what enables reusing the preprocessing routine used during compression and compilation.

When programming for the DPU, is very common to exchange data between the CPU and the DPU. A clear example happens when data is pre-processed in the CPU and fed into the DPU to execute the neural network compatible layers. This communication also happens if any layers of the neural network aren't compatible with the DPU, in which case they have to be executed in the CPU. To handle this type of operations, DNNDK provides a set of APIs to facilitate the eschange of information.

DNNDK APIs to set input tensor for the computation layer or node:

dpuSetInputTensor()
dpuSetInputTensorInCHWInt8()
dpuSetInputTensorInCHWFP32()
dpuSetInputTensorInHWCInt8()
dpuSetInputTensorInHWCFP32()

DNNDK APIs to get output tensor from the computation layer or node:

dpuGetOutputTensor()
dpuGetOutputTensorInCHWInt8()
dpuGetOutputTensorInCHWFP32()
dpuGetOutputTensorInHWCInt8()
dpuGetOutputTensorInHWCFP32()

DNNDK provides the following APIs to get the start address, size, quantization factor, and shape info for DPU input and output tensor:

dpuGetTensorAddress()
dpuGetTensorSize()
dpuGetTensorScale()
dpuGetTensorHeight()
dpuGetTensorWidth()
dpuGetTensorChannel()

TensorFlow Model

This framework enables using very flexible pre-processing routines, with input images in BGR or RGB format. Therefore, the pre-defined APIs in the library libdputils.so cannot be used directly when deploying TenorFlow models. This means the users have to implement the pre-processing code themselves.

Although both languages are suported, C++ gives a better performance, so it is recommended to port the final applications to C++.

Application workflow

The structure of an application making use of the APIs provided by the DNNDK v3.1 package is now described. Most of the steps always use the same API function, but in some of them, such as the input and output of the network data, several APIs can be used.

Open the DPU: This step is perfomed with the ´dpuOpen()´ API. The function attaches and opens the DPU device before the utilization of DPU resources.
Load the DPU kernel: uses the dpuLoadKernel() API. It loads a DPU kernel for the specified neural network from hybrid CPU+DPU binary executable into DPU memory space, including Kernel’s DPU instructions, weight and bias. The function has an argument, which is the name of the DPU kernel outputed by the DNNC compiler.
Create a DPU task: performed with dpuCreateTask(). It instantiates a DPU task from DPU Kernel and allocates corresponding DPU memory buffer. It takes a pointer to the kernel as a parameter. It is possible to indicate the mode of the task between the normal mode, which is default, profiling mode, that outputs performance information layer by layer, or dump mode, which dumps raw data for the DPU task. This last two modes are only available if the DNNC tool compiled the model in debug mode.
Network input data: there are several APIs that can be used to input the data depending on the format it has. When working with images, dpuSetInputImage2() enables loading an image to the DPU without specifying its mean value. Its arguments are a pointer to the DPU task, the name of the input node and the image itself as a Mat object. The Mat datatype can be used with the opencv library. If your input is already a tensor, can use the APIs specified in section DPU Tensor to set the input.
DPU task execution: performed with dpuRunTask(). This task only needs a pointer to the task that is to be executed, and performs inference of the network.
Network output data: there are several APIs in section DPU Tensor to retrieve the data output in several formats.
Destroy the task: performed with dpuDestroyKernel(). Destroys the DPU task and releases its resources. It takes the pointer of the task as an argument. Returns a 0 on success or negative value in failure.
Destroy the kernel: performed with dpuDestroyTask(). Destroys the DPU kernel and releases its resources. Takes as an argument the pointer to the kernel. Retuns a 0 if success.
Close the DPU: performed with dpuClose(). Detaches and closes the DPU device file.

If inference is only going to be performed once, the image preprocesing and results post procesing can be made at any point of the program before the input of the network's data and after the output of the network's results, respectivelly. When iference is going to be performed for several images, it is recommended to include steps 4, 5 and 6 in a loop. Perform the image pre-processing in the loop previous to the input of the data, and the post-processing in the loop, after the output of the data.

There is several special cases that are now presented:

One DPU and CPU kernel: In the case there is one DPU and CPU kernel, the workflow is exactly the same as the one seen previously, but after step 6, the layers executed in the CPU have to be run.
More than one DPU and CPU kernel: if the application only performs inference of one image, steps 2 through 8 have to be executed for one kernel, and then repeated with a new kernel. If inference is performed through several images, steps 2 throgh 8 have to be included inside a loop twice, or as many times as DPU kernels there are, with the image preprocesing being preformed before the input data in for the first kernel and the postprocessing performed after the output data of the last kernel.

In addition to the described APIs, there is several more that can be used all along an application to retrieve the tensor address, size, height, width or number of channels, for example. All the details of all the APIs can be found in the DNNDK v3.1 User Guide.

DPU Hybrid Compilation

Applications developed for the DPU are heterogeneus programs that have code running on the target CPU and code running on the DPU. The code for CPU can be created with C/C++ language and later on be processed by a compiler such as GCC. The neural network, on the other hand, is compiled by DNNC for the DPU. In the final stage of the application, these codes have to be linked togetherby a linker such as GCC, to produce a single hybrid binary executable.

DPU Shared Library

In some cases DPU ELF files cannot be linked with the CPU code. One case is when the CPU code is created with the Python APIs. In these cases, after the Caffe or TensorFlow models are compiled to DPU ELF files, the users have to use ARM GCC toolchain to transform them into DPU shared libraries.

For x64 host system, ARM cross toolchain like aarch64-linux-gnu-gcc for 64-bit ARM or arm-linux-gnu-gcc for 32-bit ARM can be used. For DNNDK evaluation boards, gcc toolchain can be used. The command samples for ResNet50 look as the followings:

aarch64-linux-gnu-gcc -fPIC -shared \
    dpu_resnet50_*.elf -o libdpumodelresnet50.so

By using the *, all the DPU ELF files are covered and wrapped into the libdpumodelresnet50.so. This is useful when the DNNC compiler outputs more than one DPU kernel. Moreover, for each neural network model, each DPU ELF files should be linked in one unique shared library. If there is more than one neural network model in one DNNDK application, users must create as many shared libraries as models are. This libraries should be placed in the same folder of the DNNDK application.

DNNDK v3.1 package workflow - UviDTE-FPSoC/Zynq7000-dnn-inference GitHub Wiki

Table of Contents

Download a pretrained model

Download calibration and evaluation images

Prepare test dataset for target board

Network Compression

Caffe workflow

Calibration Images Directory

Caffe Quantization

Output

TensorFlow workflow

Freeze the network

Calibration dataset and input function

TensorFlow Quantization

Output and evaluation

Dump quantize simulation results

Network Compilation

DLet

Compilation

Programing with DNNDK APIs

DPU Kernel

DPU Task

DPU Node

DPU Tensor

TensorFlow Model

Application workflow

DPU Hybrid Compilation

DPU Shared Library

⚠️ GitHub.com Fallback ⚠️

DNNDK v3.1 package workflow - UviDTE-FPSoC/Zynq7000-dnn-inference GitHub Wiki

Table of Contents

Download a pretrained model

Download calibration and evaluation images

Prepare test dataset for target board

Network Compression

Caffe workflow

Calibration Images Directory

Caffe Quantization

Output

TensorFlow workflow

Freeze the network

Calibration dataset and input function

TensorFlow Quantization

Output and evaluation

Dump quantize simulation results

Network Compilation

DLet

Compilation

Programing with DNNDK APIs

DPU Kernel

DPU Task

DPU Node

DPU Tensor

TensorFlow Model

Application workflow

DPU Hybrid Compilation

DPU Shared Library

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️