Driver interface API - DigitalMediaProfessionals/dv-sdk GitHub Wiki

Complete list of functions and structures can be found in dmp_dv.h and dmp_dv_cmdraw_v0.h

Context

To work with the device context must be created first. dmp_dv_context is the data type for working with the context.

`dmp_dv_context_create`

dmp_dv_context dmp_dv_context_create();

Creates context for working with AI processor. It is thread-safe.

Returns Non-NULL on success, NULL on error.

`dmp_dv_context_release`

void dmp_dv_context_release(dmp_dv_context ctx);

Releases context for working with AI processor (decreases reference counter). Should be called this when ctx is no longer needed. It is thread-safe.

ctx Context for working with AI processor, when NULL it is ignored.

`dmp_dv_context_retain`

void dmp_dv_context_retain(dmp_dv_context ctx);

Retains context for working with AI processor (increases reference counter). It is thread-safe.

ctx Context for working with AI processor, when NULL it is ignored.

`dmp_dv_context_get_info_string`

const char *dmp_dv_context_get_info_string(dmp_dv_context ctx);

Returns information about context as human-readable string. It is thread-safe.

dmp_dv_context_get_info

// Structure with information about the context.
struct dmp_dv_info {
  uint32_t size;            // size of this structure
  uint32_t version;         // version of this structure
};


// Structure with information about the context (version 0).
struct dmp_dv_info_v0 {
  struct dmp_dv_info header;   // general structure information
  int32_t ub_size;             // unified buffer size
  int32_t max_kernel_size;     // maximum supported convolutional kernel size
  int32_t conv_freq;           // convolutional block frequency in MHz
  int32_t fc_freq;             // fully connected block frequency in MHz
  int32_t max_fc_vector_size;  // fully connected block maximum input vector size in elements
  int32_t rsvd;                // padding to 64-bits
};

int dmp_dv_context_get_info(dmp_dv_context ctx, struct dmp_dv_info *info);

Fills structure with information about the context. On return, the version field will be set to maximum supported version less or equal to the requested, the fields of the corresponding structure will be set only if the size is enough. It is thread-safe.

ctx Context for working with AI processor, when NULL it is ignored.
info Structure to be filled, fields size and version must be set.

Returns 0 on success, non-zero otherwise.

Memory management

This section contains functions for allocating and working with memory accessible to the device. dmp_dv_mem is the data type for memory allocation.

`dmp_dv_mem_alloc`

dmp_dv_mem dmp_dv_mem_alloc(dmp_dv_context ctx, size_t size);

Allocates physically continuous chunk of memory accessible to the device. Memory is allocated using ION with CMA and is not yet mapped to user or kernel address space. It is thread-safe.

ctx Context for working with AI processor, when NULL the error is returned.
size Memory size in bytes.

Returns handle for the allocated memory or NULL on error.

`dmp_dv_mem_release`

void dmp_dv_mem_release(dmp_dv_mem mem);

Releases allocated memory (decreses reference counter). Should be called when mem is no longer needed. dmp_dv_mem_unmap() will be called automatically before the memory is returned to the system. It is thread-safe.

mem Handle for the allocated memory, when NULL it is ignored.

`dmp_dv_mem_retain`

void dmp_dv_mem_retain(dmp_dv_mem mem);

Retains allocated memory (increases reference counter). It is thread-safe.

mem Handle for the allocated memory, when NULL it is ignored.

`dmp_dv_mem_map`

uint8_t *dmp_dv_mem_map(dmp_dv_mem mem);

Maps previously allocated memory to the user address space. Retuned memory can be read or written, executable flag is not set. If the memory was already mapped, the same pointer will be returned. It is thread-safe only on different memory handles.

mem Handle to the allocated memory, when NULL the error is returned.

Returns pointer to memory region in user address space or NULL on error.

`dmp_dv_mem_unmap`

void dmp_dv_mem_unmap(dmp_dv_mem mem);

Unmaps previously allocated and mapped memory from the user address space. Function can be called repeatedly. dmp_dv_mem_sync_end() will be called automatically before unmapping. It is thread-safe only on different memory handles.

mem Handle to the allocated memory, when NULL the error is returned.

`dmp_dv_mem_sync_start`

int dmp_dv_mem_sync_start(dmp_dv_mem mem, int rd, int wr);

Starts Device <-> CPU synchronization of the memory buffer. When called multiple times with the same or less flags rd | wr, the function does nothing. It is thread-safe only on different memory handles.

mem Handle to the allocated memory, when NULL the error is returned.
rd If non-zero, the Device -> CPU synchronization will occur before this function returns.
wr If non-zero, the CPU -> Device synchronization will occur on dmp_dv_mem_sync_end().

Returns 0 on success, non-zero otherwise.

`dmp_dv_mem_sync_end`

int dmp_dv_mem_sync_end(dmp_dv_mem mem);

Finishes the last started Device <-> CPU synchronization. When calling second time before next call to dmp_dv_mem_sync_start(), the function does nothing. It is thread-safe only on different memory handles.

mem Handle to the allocated memory, when NULL the error is returned.

Returns 0 on success, non-zero otherwise.

`dmp_dv_mem_get_size`

size_t dmp_dv_mem_get_size(dmp_dv_mem mem);

Returns allocated size in bytes for the provided memory handle. It is thread-safe.

mem Handle to the allocated memory, when NULL the function will return 0 and the error message is set.

Returns size in bytes (can be greater than requested in dmp_dv_mem_alloc()) or 0 if mem is NULL.

Command lists

Command list contains several commands for execution, e.g. chain of convolution operations. It packs the commands in the hardware-specific ready for execution format allowing low-delay execution of multiple operations at once. dmp_dv_cmdlist is the data type for command list.

`dmp_dv_cmdlist_create`

dmp_dv_cmdlist dmp_dv_cmdlist_create(dmp_dv_context ctx);

Creates command list. It is thread-safe.

ctx Context for working with AI processor, when NULL the error is returned.

Returns handle to command list or NULL on error.

`dmp_dv_cmdlist_release`

void dmp_dv_cmdlist_release(dmp_dv_cmdlist cmdlist);

Releases the command list (decreases reference counter). Should be called when cmdlist is no longer needed. It is thread-safe.

cmdlist Handle to command list, when NULL it is ignored.

`dmp_dv_cmdlist_retain`

void dmp_dv_cmdlist_retain(dmp_dv_cmdlist cmdlist);

Retains the command list (increases reference counter). It is thread-safe.

cmdlist Handle to command list, when NULL it is ignored.

`dmp_dv_cmdlist_commit`

int dmp_dv_cmdlist_commit(dmp_dv_cmdlist cmdlist);

Commits the command list, preparing device-specific structures for further execution. It is thread-safe only on different command lists.

cmdlist Handle to command list, when NULL the error is returned.

Returns 0 on success, non-zero otherwise.

`dmp_dv_cmdlist_exec`

int64_t dmp_dv_cmdlist_exec(dmp_dv_cmdlist cmdlist);

Schedules command list for execution. Each context is associated with a single execution queue. It is thread-safe.

cmdlist Handle to command list, when NULL the error is returned.

Returns exec_id >= 0 for this execution on success, < 0 on error.

`dmp_dv_cmdlist_wait`

int dmp_dv_cmdlist_wait(dmp_dv_cmdlist cmdlist, int64_t exec_id);

Waits for the specific scheduled command to be completed. It is thread-safe.

cmdlist Handle to command list, when NULL the error is returned.
exec_id Id of the scheduled command to wait for completion.

Returns 0 on success, non-zero otherwise.

`dmp_dv_cmdlist_add_raw`

// Memory buffer specification.
typedef struct dmp_dv_buf_impl {
  union {
    dmp_dv_mem *mem;  // memory handle
    uint64_t rsvd;    // padding to 64-bit size
  };
  uint64_t offs;      // offset from the start of the buffer, must be 16-bit aligned
} dmp_dv_buf;

// Convolutional device type id.
#define DMP_DV_DEV_CONV 1

/// Fully connected device type id.
#define DMP_DV_DEV_FC 2

/// Upper bound of different device type ids.
#define DMP_DV_DEV_COUNT 3

/// Raw command for execution.
struct dmp_dv_cmdraw {
  uint32_t size;        // size of this structure
  uint8_t device_type;  // device type
  uint8_t version;      // version of this structure
  uint8_t rsvd[2];      // padding to 64-bit size
} ;

// Convolution layer runs.
struct dmp_dv_cmdraw_conv_v0_run {
  dmp_dv_buf weight_buf;    // Buffer with packed weights
  uint32_t conv_pad;        // Bits [7:0] = left padding, bits [15:8] = right padding, bits [23:16] = top padding, bits [31:24] = bottom padding
  uint32_t pool_pad;        // Bits [7:0] = left padding, bits [15:8] = right padding, bits [23:16] = top padding, bits [31:24] = bottom padding
  uint16_t m;               // Number of Output Channels
  uint16_t conv_enable;     // 1 = Enabled, 0 = Disabled, 3 = Depthwise
  uint16_t p;               // Filter Size (bits[7:0] = width, bits[15:8] = height)
  uint16_t pz;              // Filter Depth (1 in case of 2D convolution)
  uint16_t conv_stride;     // Bits [7:0] = X stride, bits [15:8] = Y stride
  uint16_t conv_dilation;   // Bits [7:0] = X dilation, bits [15:8] = Y dilation
  uint16_t weight_fmt;      // Weights format (0 = random access blocks, 1 = compact stream, 3 = 8-bit quantized stream)
  uint16_t pool_enable;     // 0 = disabled, 1 = max pooling, 2 = average pooling, 4 - upsampling
  uint16_t pool_avg_param;  // Usually be set to 1/pool_size^2 in FP16 when using average pooling (average pooling assumes square size)
  uint16_t pool_size;       // Bits [7:0] = width, bits [15:8] = height
  uint16_t pool_stride;     // Bits [7:0] = X stride, bits [15:8] = Y stride
  uint16_t actfunc;         // Activation Function: 0 = None, 1 = Tanh, 2 = Leaky ReLU, 3 = Sigmoid, 4 = PReLU, 5 = ELU, 6 = ReLU6
  uint16_t actfunc_param;   // Leaky ReLU parameter in FP16
  uint16_t rectifi_en;      // Rectification, i.e. max(0, x) (NOTE: Can be applied after non-ReLU activation function)
  uint16_t lrn;             // Bits [0]: 1 = LRN enable, 0 = LRN disable, [1]: 1 = incl. power func, 0 = excl., [8:11]: x^2 scale factor log2
  uint16_t rsvd;            // padding to 64-bit size
};

// Raw command for convolutional block version 0.
struct dmp_dv_cmdraw_conv_v0 {
  dmp_dv_cmdraw header;               // General structure information

  dmp_dv_buf input_buf;               // Input buffer
  dmp_dv_buf output_buf;              // Output buffer
  dmp_dv_buf eltwise_buf;             // Buffer for elementwise add (0 = UBUF Input Buffer)
  uint32_t topo;                      // [31:0] Output Destination of each run, 0 = UBUF, 1 = EXTMEM
  uint16_t w;                         // Input Width
  uint16_t h;                         // Input Height
  uint16_t z;                         // Input Depth
  uint16_t c;                         // Input Channels
  uint16_t input_circular_offset;     // Input Depth circular offset
  uint16_t output_mode;               // 0 = concat, 1 = elementwise add

  struct dmp_dv_cmdraw_conv_v0_run run[32];  // description of each run
} dmp_dv_cmdraw_conv_v0;

// Raw command for fully connected block version 0.
struct dmp_dv_cmdraw_fc_v0 {
  dmp_dv_cmdraw header;    // General structure information

  dmp_dv_buf weight_buf;   // Buffer with packed weights
  dmp_dv_buf input_buf;    // Input buffer
  dmp_dv_buf output_buf;   // Output buffer

  uint16_t input_size;     // Size of the input in elements
  uint16_t output_size;    // Size of the output in elements

  uint16_t weight_fmt;     // Weights format: 0 = half-float unquantized, 1 = 8-bit quantized

  uint16_t actfunc;        // Activation Function: 0 = None, 1 = ReLU, 2 = Tanh, 3 = Leaky ReLU, 4 = Sigmoid, 5 = PReLU (PReLU must be used with POST-OP=1)
  uint16_t actfunc_param;  // Leaky ReLU parameter (in FP16 format), 0 = non-leaky
  uint16_t rsvd[3];        // padding to 64-bit size
};

// Raw command for convolutional block version 1.
struct dmp_dv_cmdraw_conv_v1 {
  struct dmp_dv_cmdraw header;  // General structure information

  struct dmp_dv_buf u8tofp16_table;  // u8tofp16 conversion table
  uint16_t to_bgr;                   // flag to convert input to BGR format
  uint16_t rsvd[3];                  // padding

  struct dmp_dv_cmdraw_conv_v0 conv_cmd;  // Includes version 0 command
};

int dmp_dv_cmdlist_add_raw(dmp_dv_cmdlist cmdlist, struct dmp_dv_cmdraw *cmd);

Adds raw command to the command list. It is thread-safe only on different command lists.

cmdlist Handle to command list, when NULL the error is returned.
cmd Raw command for execution. See dmp_dv_cmdraw_v0.h for the description of command version 0. See dmp_dv_cmdraw_v1.h for the description of command version 1.

Returns 0 on success, non-zero otherwise, known error codes:

EINVAL Invalid argument such as structure size
ENOTSUP Raw command version is not supported.

Weights packing

This section contains functions for packing layer weights (Convolutional/Fully Connected) to the hardware-specific layout.

`dmp_dv_pack_conv_weights`

int dmp_dv_pack_conv_weights(
    int n_channels, int kx, int ky, int n_kernels,
    const uint16_t quant_map[256],
    const void *weights, const uint16_t *bias,
    uint8_t *packed_weights, size_t *packed_weights_size);

Packs convolution layer weights and biases into output array. It is thread-safe.

n_channels Number of input channels.
kx Kernel width.
ky Kernel height.
n_kernels Number of output channels.
quant_map Quantization table for weights (but not bias), 256 elements, can be NULL.
weights If quant_map is NULL, array of half precision floating point weights in NCHW format (N=output_channel_size), else array of 1-byte indices.
bias Array of half precision floating point biases of size n_kernels.
packed_weights Output buffer for packed weights information (can be NULL if packed_weights_size is 0).
packed_weights_size On input, contains the size of the packed_weights buffer in bytes (can be 0, in such case it will be filled with the required buffer size), on output will contain the required buffer size.

Returns 0 on success, non-zero otherwise.

`dmp_dv_pack_dil_weights`

int dmp_dv_pack_dil_weights(
    int n_channels, int kx, int ky, int n_kernels,
    const uint16_t quant_map[256],
    const void *weights, const uint16_t *bias,
    uint8_t *packed_weights, size_t *packed_weights_size);

Packs dilated convolution layer weights and biases into output array. It is thread-safe.

n_channels Number of input channels.
kx Kernel width.
ky Kernel height.
n_kernels Number of output channels.
quant_map Quantization table for weights (but not bias), 256 elements, can be NULL.
weights If quant_map is NULL, array of half precision floating point weights in NCHW format (N=output_channel_size), else array of 1-byte indices.
bias Array of half precision floating point biases of size n_kernels.
packed_weights Output buffer for packed weights information (can be NULL if packed_weights_size is 0).
packed_weights_size On input, contains the size of the packed_weights buffer in bytes (can be 0, in such case it will be filled with the required buffer size), on output will contain the required buffer size.

Returns 0 on success, non-zero otherwise.

`dmp_dv_pack_fc_weights`

int dmp_dv_pack_fc_weights(
    int c_input, int h_input, int w_input,
    int c_output, int h_output, int w_output,
    const uint16_t quant_map[256],
    const void *weights, const uint16_t *bias,
    uint8_t *packed_weights, size_t *packed_weights_size);

Packs fully connected layer weights and biases into output array possibly rearranging them to match input and output shapes. The function packs weights in NCHW format to the AI processor input format WHC8 (n_channels / 8, width, height, 8 channels) with rearranging to produce output in AI processor format WHC8. It is thread-safe.

c_input Number of input channels.
h_input Input height (set to 1 for 1D input).
w_input Input width (set to 1 for 1D input).
c_output Number of output channels.
h_output Output height (set to 1 for 1D output).
w_output Output width (set to 1 for 1D output).
quant_map Quantization table for weights (but not bias), 256 elements, can be NULL.
weights If quant_map is NULL, array of half precision floating point weights in NCHW format (N=output_channel_size), else array of 1-byte indices.
bias Array of half precision floating point biases of size output_size=c_output*h_output*w_output.
packed_weights Output buffer for packed weights information (can be NULL if packed_weights_size is 0).
packed_weights_size On input, contains the size of the packed_weights buffer in bytes (can be 0, in such case it will be filled with the required buffer size), on output will contain the required buffer size.

Returns 0 on success, non-zero otherwise.

General information functions

This section contains general information functions.

`dmp_dv_get_version_string`

const char *dmp_dv_get_version_string();

Returns version string of the driver interface. It is thread-safe.

The returned string starts with HW_MAJOR.HW_MINOR.YYYYMMDD for example "7.0.20181214":

HW_MAJOR - supported hardware revision major,
HW_MINOR - supported hardware revision minor,
YYYYMMDD - release date.

`dmp_dv_get_last_error_message`

const char *dmp_dv_get_last_error_message();

Returns last error message. It might return garbage if several functions will fail from multiple threads simultaneously during this function call.

`dmp_dv_fpga_device_exists`

int dmp_dv_fpga_device_exists(dmp_dv_context ctx, int dev_type_id);

Returns if the dedicated fpga block exists.

ctx Context for working with AI processor, when NULL -1 is returned.
dev_type_id Device type id. This must be DMP_DV_DEV*.

Returns -1 on invalid argument, 1 if the block exists, 0 otherwise.