Snapdragon NPE Survey - PaddlePaddle/Mobile GitHub Wiki
Architecture
- Convert model to DLC(Deep Learning Container) file.
- Support Caffe, Caffe2, TensorFlow
- Use SNPE (Snapdragon Neural Processing Engine) to run the model.
- Support Snapdragon CPU, Adreno GPU, HexagonTM DSP (fixed point).
C++ API
- Get Available Runtime: CPU,GPU or DSP.
static zdl::DlSystem::Runtime_t Runtime;
if (zdl::SNPE::SNPEFactory::isRuntimeAvailable(zdl::DlSystem::Runtime_t::GPU)) {
Runtime = zdl::DlSystem::Runtime_t::GPU;
} else {
Runtime = zdl::DlSystem::Runtime_t::CPU;
}
- Load Container (called load model)
std::unique_ptr<zdl::DlContainer::IDlContainer> container;
container = zdl::DlContainer::IDlContainer::open(dlc_file);
- Set Network Builder Options Based on the Container
std::unique_ptr<zdl::SNPE::SNPE> snpe;
zdl::SNPE::SNPEBuilder snpeBuilder(container.get());
snpe = snpeBuilder.setOutputLayers({})
.setRuntimeProcessor(runtime)
.setUdlBundle(udlBundle)
.setUseUserSuppliedBuffers(useUserSuppliedBuffers)
.build();
-
Load Network Inputs
Network inputs and outputs can be either user-backed buffers or ITensors (built-in SNPE buffers), but not both. The advantage of using user-backed buffers is that it eliminates an extra copy from user buffers to create ITensors.
- zdl::DlSystem::UserBufferMap& userBufferMap
- zdl::DlSystem::ITensor
-
Execute the Network & Process Output
zdl::SNPE::SNPE snpe; zdl::DlSystem::UserBufferMap& inputMap; zdl::DlSystem::UserBufferMap& outputMap; snpe->execute(inputMap, outputMap); // or use TensorMap // bool execute(const zdl::DlSystem::TensorMap &input, // zdl::DlSystem::TensorMap &output) noexcept;
-
Summary
- zdl::DlContainer::IDlContainer
- zdl::SNPE::SNPEBuilder
- zdl::SNPE::SNPE
- zdl::DlSystem::ITensor
- zdl::SNPE::SNPEFactory::getTensorFactory().createTensor
- zdl::DlSystem::TensorMap
- zdl::DlSystem::UserBufferMap
Benchmark
Not testing, the number is from the doc.
-
Performance
- Running on the GPU
- Typically, 6X-10X speed compared with CPU
- However, 4-6ms overhead for network execution on the GPU. Small network(less than 10ms on the GPU) run faster on CPU.
- Running on the GPU
-
Size
Lib | libSNPE.so | libSNPE_G.so |
---|---|---|
arm-android-gcc4.9 | 3.4M | 1.7M |
aarch64-android-gcc4.9 | 8.2M | 7.2M |
How to integrate in caffe2?
-
- One input is DLC model buffer.
with open('submodel.dlc', 'rb') as f: dlc = f.read() op = core.CreateOperator('SNPE', ['data_in'], ['data_out'], arg=[ utils.MakeArgument("model_buffer", dlc), utils.MakeArgument("input_name", "data") ] )
- Wrap the C++ APIs of SPNE, then use these interfaces in SNPEOp: see snpe_ffi.cc