Hexagon NPU FastRPC Backend Overview - chraac/llama.cpp GitHub Wiki

Overview

The Hexagon NPU FastRPC backend provides hardware acceleration for GGML operations using Qualcomm's Hexagon NPU capabilities. This backend is built entirely from scratch using Qualcomm's FastRPC framework, providing a low-level alternative to QNN SDK-based implementations. By bypassing high-level abstractions and programming directly with HVX intrinsics, it is designed to offload compute-intensive operations from the CPU to Qualcomm's specialized Hexagon NPU hardware, enabling maximum performance and power efficiency for machine learning workloads on Snapdragon platforms.

Key Features

Custom FastRPC Implementation

Built from Scratch: Complete custom implementation using Qualcomm's FastRPC framework for direct hardware control
Minimal Abstraction Layer: Lightweight, efficient abstraction objects on the host side for seamless GGML graph/tensor/buffer offloading
Zero-Copy Communication: FastRPC and ION buffers enable efficient data sharing between CPU and NPU domains

Direct Hardware Programming

Raw HVX Intrinsics: Critical operations like mul_mat and add implemented using direct Hexagon Vector Extensions
Custom Thread Pool: 4-thread parallel execution matching NPU hardware capabilities with intelligent load balancing
VTCM Management: Thread-specific VTCM operations that fully leverage high-speed VTCM to reduce memory bandwidth pressure
L2 Cache Optimization: Prefetching and cache-aware memory access patterns for maximum bandwidth utilization

Advanced Quantization Support

Hardware-Accelerated Formats: Q4_0, Q8_0, Q4_K quantized data types with custom HVX dequantization functions
Mixed Precision Operations: Support for quantized (Q4_0, Q8_0, Q4_K) and FP32 mixed operations with efficient conversion tables

Supported Operations

FastRPC Custom Kernels

Matrix Multiplication (mul_mat): Custom HVX implementation with 4-thread parallelization for transformer workloads
Element-wise Operations: Add, multiply operations with broadcasting support using direct HVX intrinsics
RMS Normalization: Hand-optimized kernels for layer normalization operations common in modern LLMs
Graph-level Execution: Entire computation graphs executed on NPU to minimize CPU-NPU memory transfers and maximize efficiency

Platform Support

Android: Primary target platform with Snapdragon SoCs featuring Hexagon NPU (8 Gen 1+, 8cx Gen 3+)
Linux: Development and testing support for Hexagon-enabled platforms
Windows on ARM: Cross-platform compatibility for Snapdragon-based Windows devices

Key Advantages

Minimal Overhead: FastRPC implementation provides direct hardware access with virtually no abstraction penalties
Maximum Hardware Utilization: Custom implementation leverages all 4 hardware threads and HVX units on the Hexagon NPU
Memory Bandwidth Optimization: Graph-level execution and zero-copy transfers reduce bottlenecks between CPU and NPU
Power Efficiency: Direct NPU execution typically provides 2-5x better performance-per-watt compared to CPU execution
Scalable Architecture: Designed to efficiently handle both small and large model workloads

Technical Implementation

Based on FastRPC Framework

Host Side (`ggml/src/ggml-qnn/npu/host/`)

Host Device Management: host_device.cpp - NPU device interface and lifecycle management
Host Graph Coordination: graph.cpp - graph creation, caching, and execution coordination
Buffer Management: buffer.cpp - RPC memory buffers, ION allocation, and zero-copy data transfer
Type Conversion Utilities: Efficient host-device data format conversion and RPC interface helpers

Device Side (`ggml/src/ggml-qnn/npu/device/`)

Device Runtime: device.cpp - core NPU-side runtime executing on Hexagon hardware
Graph Execution Engine: graph.cpp - NPU-side graph computation with 4-thread parallelization
HVX Operation Kernels: op_impl.cpp - hand-optimized HVX intrinsic implementations
Quantization Kernels: quants.cpp - HVX-optimized dequantization for Q4_0, Q8_0, Q4_K
Thread Management: thread_pool.hpp - custom 4-thread pool using QURT primitives
Memory Management: vtcm_mem.hpp - VTCM allocation with RAII semantics

FastRPC Interface (`ggml/src/ggml-qnn/npu/idl/`)

Interface Definition: hexagon_npu.idl - defines the RPC contract between host CPU and Hexagon NPU

Benchmark Results

We conducted some performance testing to evaluate the Hexagon NPU FastRPC backend against CPU-only execution. The benchmarks focus on large matrix multiplication operations—a critical compute bottleneck in transformer-based models.

Testing Methodology

We extended the test-backend-ops to include large matrix multiplication scenarios that represent typical LLM inference patterns:

diff --git tests/test-backend-ops.cpp tests/test-backend-ops.cpp
index 9ec24d9f23c5bc93b1b1e98e890e1186632358f7..584150154eee761f2d300504c525d38265fe3eb0 100644
--- tests/test-backend-ops.cpp
+++ tests/test-backend-ops.cpp
@@ -4239,6 +4239,8 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  1, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16,  8, 1024, {3, 2}, {1, 1}));
             test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16, 16, 1024, {3, 2}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 8192, 1, 8192, {1, 1}, {1, 1}));
+            test_cases.emplace_back(new test_mul_mat(type_a, type_b, 16384, 1, 16384, {1, 1}, {1, 1}));
         }
     }
     for (ggml_type type_a : other_types) {

Performance Results

The following table compares execution time (lower is better) across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:

devices	commit	type	src0_dim	src1_dim	cpu_time(us)	host_total(us)	host_param_update(us)	device_total(us)	device_dequant(us)	device_compute(us)
8gen2	8409dd1e9	F32	16384x16384	16384x1	38935	285529	296	89518	0	89518
8gen2	8409dd1e9	Q8_0	16384x16384	16384x1	8930	327385	774	255894	245178	10716
8gen2	8409dd1e9	Q4_0	16384x16384	16384x1	12503	143390	735	96932	86927	10005

Key Observations

The benchmark results reveal several important insights about the Hexagon NPU FastRPC implementation:

Dequantization Bottleneck:
- For quantized formats, 90-96% of NPU time is spent on dequantization, making this our primary optimization target
Compute Efficiency:
- When data resides in VTCM memory, the NPU shows excellent computational performance (~10,000 μs) for matrix multiplication operations
- This represents a ~4× improvement over CPU FP32 performance when comparing pure computation time
- However, the overall NPU performance is currently limited by memory transfers and dequantization overhead
Memory Access Patterns:
- Pure F32 computation on NPU (~90,000 μs) is slower than CPU (~39,000 μs) due to memory access patterns
- The significant performance difference between F32 and post-dequantization Q4_0/Q8_0 execution suggests optimization potential through better VTCM utilization
Communication Overhead:
- The host_param_update time (700-800 μs) represents FastRPC communication overhead
- While minimal for large operations, this overhead becomes proportionally significant for smaller tensor operations
- Batching operations into larger computation graphs would help amortize these costs

Future Developments

Extended Operation Coverage: Additional custom HVX implementations for more GGML operations (attention, softmax, etc.)
Dynamic Thread Scheduling: Runtime load balancing and work-stealing across the 4 hardware threads
Performance Profiling Suite: Custom profiling tools for FastRPC execution paths and bottleneck analysis
Advanced Graph Fusion: Sophisticated operation fusion techniques to minimize memory transfers and maximize NPU utilization