Network input output layout - DigitalMediaProfessionals/dv-sdk GitHub Wiki

Currently the input/output of the network to the AI FPGA module needs to be transposed, unless the transpose_weight flag is set when running the network convertor (referred to as the convertor). The reason is that the internal buffer on the FPGA is not that big, so when the dimensions of an input image or the input buffers of hidden layers is too large, the FPGA will partition the run into several tiles. The limitation is on the width axis, so if the width of the input image is too large, the FPGA will need several runs to finish handle the layer.

Since in most cases the input images will have a larger width than height on its dimension, to minimize the run partitions, it is better to transpose the image to let the height larger than width dimension.

If the transpose_weight flag is set when running the convertor, then instead the weight matrix will be transposed so the input/output of the network can retain their original layout.

Also the FPGA perform calculations in 16-bits floating points numbers. So the input image need to be converted to 16-bits floating points numbers too. It is estimated that the FP32 to FP16 conversion would be more costly, so the transpose of the image should not cause that much of performance hit.

Network I/O Layout

The layout of a converted network input is DWHC format in FP16 type. If the network is converted with transpose_weight = 1, then it is DHWC format in FP16 type.

The layout of a converted network output is DWHC format in float type.

Convolution Block I/O layout

The convolution block use the following layout for I/O buffers:

If the channels of the image <= 8
- The image data is DWHC format (or DHWC format if transpose_weight flag is set), with each pixel using 16-bits floating point number. D refers to Depth (for 3D input). W refers to Width. H refers to Height. And C refers to Channel.
  - For example, the pixel data of (d, w, h, c) will be located at offset = (d * (W * H * C) + w * (H * C) + h * C + c) * 2 bytes from the beginning of the image, if transpose_weight is not set.
  - It will be located at offset = (d * (W * H * C) + h * (W * C) + w * C + c) * 2 bytes from the beginning of the image instead, if transpose_weight is set.
If the channels of the image > 8
- The image will be divided to several chunks, with each chunk contains at most 8 channels of data. Each chunk of the image data use the DWHC format (or DHWC format if transpose_weight flag is set), as described above.
  - For example, if the channels of the image is 20, it will be divided into 3 chunks, where the first chunk contains image data of 0-7 channels, the second chunk contains image data of 8-15 channels, and the last chunk contains image data of 16-19 channels.
  - the offset of pixel data of (x, y, n) can be calculated as follows:
```
chunk_id = n / 8
n_ = n % 8
C_ = 8     , if (n / 8 < C / 8)
   = C % 8 , if (n / 8 == C / 8)
if not transpose_weight:
    offset = (chunk_id * (W * H * 8) + x * (H * C_) + y * C_ + n_) * 2
else:
    offset = (chunk_id * (W * H * 8) + y * (W * C_) + x * C_ + n_) * 2
```

Note that although we describe the input/output buffer as image data, it is not limited to just images. Any data that can represents as 1D, 2D, or 3D arrays with multiple channels can be used.

Fully connected block I/O layout

The fully connected block treats its input/output buffer as 1D array, so there is no special handling needed.

When directly take the output buffer of convolution block as the input buffer, the convertor re-arrange the weight when the dimension W > 1 or H > 1.