Notes on onyx rpi conversion for hailo‐8 chip fun - krickert/search-api GitHub Wiki

Great, I’ll look into options for running NLP tasks—like sentence embeddings, NER, and sentiment analysis—on the Hailo-8 using a Raspberry Pi 5. I’ll focus on models like MiniLM or similar that can be converted to Hailo format. I’ll also find solid instructions and commands for compiling these models using the Hailo SDK.

I'll get back to you shortly with models, conversion steps, and example code for inference in Python.

Deploying Lightweight NLP Models on Hailo-8 (Raspberry Pi 5 AI Kit)

Candidate Pretrained NLP Models for Hailo-8

To fit the constraints of the Hailo-8 accelerator, we should choose compact transformer models that have far fewer parameters than BERT-base (110M) and can be quantized to 8-bit. The following are suitable candidates (all can be exported to ONNX for Hailo compilation):

Other options: MobileBERT (~25M) or ELECTRA-Small (~14M) are also viable if an ONNX export is available. All these models are compatible with ONNX export (via Hugging Face Transformers or ONNX Model Zoo), which is the first step toward running them on Hailo-8.

Converting Models to Hailo Format (HEF)

To run these models on Hailo-8, convert the pretrained model to a Hailo Executable Format (.hef) using the Hailo Dataflow Compiler. The high-level steps are:

  1. Obtain an ONNX model: You can download a pretrained ONNX if available (e.g. Hugging Face Hub often hosts ONNX versions of models), or export the model using transformers.onnx or optimum. For example, export DistilBERT to ONNX with Hugging Face Transformers transformers.onnx or torch.onnx.export. Ensure the ONNX includes the model’s full graph (for a fine-tuned model, from input IDs to output logits).

  2. (Optional) Simplify the ONNX: It’s recommended to run [onnx-simplifier](https://github.com/daquexian/onnx-simplifier) on the model before compilation ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Parsing%20)). This removes redundant ops and resolves dynamic shapes, which helps the Hailo compiler. For example:

    pip install onnx-simplifier
    python -m onnxsim model.onnx model_simplified.onnx
    
  3. Set up Hailo Dataflow Compiler (DFC): Model compilation must be done on a PC, not on the Pi (the process is compute-intensive) ([Onnx -> hef conversion help - General - Hailo Community](https://community.hailo.ai/t/onnx-hef-conversion-help/6453#:~:text=Please%20download%20the%20Hailo%20AI,platform%20like%20the%20Raspberry%20Pi)). Install the Hailo AI SDK or use the Docker image from Hailo Developer Zone (which includes the DFC). Activate the Hailo DFC environment (e.g. via the provided virtualenv or Docker). You can launch hailo tutorial inside the environment to open Jupyter notebooks with step-by-step guides ([Custom ONNX models on H8L Raspberry - General - Hailo Community](https://community.hailo.ai/t/custom-onnx-models-on-h8l-raspberry/1368#:~:text=,ONNX%20to%20HEF%20for%20inference)) if needed.

  4. Prepare a calibration dataset: Hailo-8 uses post-training quantization (INT-8/INT-4). Gather a small representative sample of input data (e.g. a list of typical sentences) to use for calibration. For an NLP model, this could be a set of tokenized inputs (numpy arrays of shape [batch, seq_len] for input IDs, etc.). Real data is best for calibration ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Optimization%5Cquantization%20,Compilation)); if unavailable, you can use --use-random-calib-set for a quick test, but expect lower accuracy. Save your calibration inputs as a NumPy .npy file if using the CLI.

  5. Parse the ONNX to Hailo format: Use the Hailo parser to convert ONNX into a Hailo Archive (HAR) file. For example, in the DFC environment run:

    hailo parser onnx model_simplified.onnx -o model.har
    

    This produces model.har (an intermediate Hailo model representation).

  6. Optimize (Quantize) the HAR: Quantize and optimize the model for Hailo-8. For example:

    hailo optimize model.har --hw-arch hailo8 \
        --calib-set-path calib_data.npy \
        --output-har-path model_q.har
    

    This uses the calibration set to quantize weights/activations (default 8-bit). Replace hailo8 with hailo8l if you have the 8L variant (the Raspberry Pi HAT uses Hailo-8L). If you don’t have a calibration file, you can add --use-random-calib-set to use random data for calibration ([Convert .onxx to .hef using the CLI - General - Hailo Community](https://community.hailo.ai/t/convert-onxx-to-hef-using-the-cli/9106#:~:text=%24,set)) (not ideal for final accuracy, but acceptable to test the flow).

  7. Compile to HEF: Finally, compile the quantized HAR for the target hardware:

    hailo compiler model_q.har --hw-arch hailo8 -o model.hef
    

    This produces model.hef, which is the binary that can be loaded onto the Hailo-8. (Use --hw-arch hailo8l if appropriate ([Hailo AI Platform Guide - Hailo-8 AI Software Suite Commands](https://developer.ridgerun.com/wiki/index.php/Hailo/Hailo-8/AI_Software_and_Tools/Hailo_Commands#:~:text=%2A%20,path%20to%20save%20the%20auto)).) If the model is very large or sequence length is high, the compiler might error (e.g. “No valid partition found”) – in that case consider reducing the sequence length or model size and re-try.

Note: Hailo also provides a Model Zoo and SDK that can automate some steps. For instance, hailomz compile ... can parse/quantize/compile in one go, but it typically expects a known model or a custom YAML/ALLS script for your network. For generic models, using the raw hailo CLI as above (parse → optimize → compile) is straightforward ([Hailo-8 model conversion - CLI - General - Hailo Community](https://community.hailo.ai/t/hailo-8-model-conversion-cli/12004#:~:text=Welcome%20to%20the%20Hailo%20Community%21)) ([Hailo-8 model conversion - CLI - General - Hailo Community](https://community.hailo.ai/t/hailo-8-model-conversion-cli/12004#:~:text=,hef%20using%20the%20CLI%20tools)). Ensure all model layers are supported by Hailo-8 (standard Transformer ops like MatMul, Add, LayerNorm, etc. are supported, but some ONNX export quirks like certain activation or reshape ops may need adjustment).

Running Inference on Hailo-8 with HailoRT

With the compiled model.hef, you can perform inference on the Raspberry Pi 5 + Hailo-8 using the HailoRT runtime API. You’ve already installed hailort-4.21.0 (Python wheel for aarch64), which provides the hailo_platform Python API. Below is a minimal Python example showing how to load a HEF and run a inference on text input:

import numpy as np
import hailo_platform as hp

# 1. Open the HEF and configure the Hailo-8 device
hef = hp.HEF("model.hef")  # compiled model file
with hp.VDevice() as device:  
    # Configure the device with this HEF (using PCIe interface for the Hailo-8 on RPi)
    cfg = hp.ConfigureParams.create_from_hef(hef, interface=hp.HailoStreamInterface.PCIe)
    network_group = device.configure(hef, cfg)[0]
    network_params = network_group.create_params()

    # 2. Prepare input data (example: sentiment analysis with DistilBERT)
    # Assume the model has two input vstreams: input_ids and attention_mask, and one output (logits)
    input_infos = hef.get_input_vstream_infos()
    output_infos = hef.get_output_vstream_infos()
    # Example dummy text
    text = "Hailo-8 is incredibly efficient!"
    # Tokenize the text to get input IDs and attention mask (using a tokenizer, not shown for brevity)
    # For demonstration, we'll use dummy arrays of appropriate shape:
    max_len = 128
    input_ids = np.zeros((1, max_len), dtype=np.int32)       # e.g. padded token IDs
    attention_mask = np.zeros((1, max_len), dtype=np.int32)  # e.g. mask (1s for tokens, 0s for padding)
    # (In practice, use Hugging Face tokenizer: 
    #   tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
    #   enc = tokenizer(text, return_tensors='np', padding='max_length', max_length=128)
    #   input_ids = enc["input_ids"]; attention_mask = enc["attention_mask"] )

    # 3. Run inference on Hailo-8
    with network_group.activate(network_params):
        # Set up input/output streams (using float32 format; runtime will quantize as needed)
        in_params = hp.InputVStreamParams.make_from_network_group(network_group, quantized=False)
        out_params = hp.OutputVStreamParams.make_from_network_group(network_group, quantized=False)
        with hp.InferVStreams(network_group, in_params, out_params) as infer_pipeline:
            # Create dictionary for inputs (using vstream names from the HEF):
            inputs = {
                input_infos[0].name: input_ids.astype(np.float32), 
                input_infos[1].name: attention_mask.astype(np.float32)
            }
            results = infer_pipeline.infer(inputs)
            output = results[output_infos[0].name]
            print("Model output:", output)

In this example, we prepare dummy input_ids and attention_mask for a DistilBERT-like model expecting two inputs. In practice, you’d use a tokenizer to produce these from real text. The HailoRT API then:

  1. Loads the HEF and configures a virtual device (VDevice) which manages the Hailo-8.
  2. Activates a network group (the compiled model).
  3. Creates input/output vstreams with quantized=False and FormatType.FLOAT32 so that you can feed normal NumPy float32/int32 arrays; HailoRT will handle quantization internally based on the calibration scales ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=input_vstreams_params%20%3D%20hpf,FLOAT32)) ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=for%20_%20in%20range,Inference%20output%3A%20%7Boutput_data)).
  4. Runs infer with a dictionary of input name to data. This returns a dictionary of output name to output tensor.

For a real test, replace the dummy data with actual tokenized input. For example, using Hugging Face on the Pi to tokenize:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
enc = tokenizer(text, return_tensors='np', padding='max_length', max_length=128, truncation=True)
inputs = {
    input_infos[0].name: enc["input_ids"].astype(np.float32),
    input_infos[1].name: enc["attention_mask"].astype(np.float32)
}
results = infer_pipeline.infer(inputs)
logits = results[output_infos[0].name]
pred_label = np.argmax(logits)  # e.g. 1 = positive sentiment

This would output the model’s logits for the example sentence, and you can interpret the result (e.g., index 1 = “positive”).

C++/Java: The HailoRT library also provides a C/C++ API (the Python hailo_platform is a wrapper). In C++, you would use similar steps – open the HEF, configure device, create input/output streams, and call network_group.infer() or stream data. Hailo’s SDK comes with C++ sample code for loading HEF and running inference (the logic mirrors the Python example above). While there is no official Java API, you could integrate via gRPC or JNI – for example, run a Python/C++ inference service and call it from Java. Since your future plan is to use gRPC, you can wrap the above inference code in a server that accepts text and returns the inference results.

Additional Setup Notes and Verification

By following these steps, you will be able to deploy sentence embeddings, NER taggers, sentiment classifiers, or other text models on the Hailo-8. Start with a simple model like DistilBERT with a small sequence length to validate the flow, then iterate as needed. With the model running standalone on the Hailo-8, you’ll be well-prepared to integrate it behind a gRPC service in the next stage. Good luck with your Hailo-8 NLP deployment!

Sources: The Hailo Community forums and documentation were referenced for best practices on model conversion and HailoRT usage ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Parsing%20)) ([Convert .onxx to .hef using the CLI - General - Hailo Community](https://community.hailo.ai/t/convert-onxx-to-hef-using-the-cli/9106#:~:text=I%20was%20already%20able%20to,I%20used%20were%20the%20following)) ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=,np%20import%20hailo_platform%20as%20hpf)), and model information from HuggingFace and research papers for DistilBERT, MiniLM, TinyBERT, and ALBERT ([[1910.01108] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://ar5iv.labs.arxiv.org/html/1910.01108#:~:text=ELMo%20180%20895%20BERT,668%20DistilBERT%2066%20410)) ([microsoft/xtremedistil-l6-h384-uncased · Hugging Face](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased#:~:text=This%20l6,base)) ( Urinary Bladder Acute Inflammations and Nephritis of the Renal Pelvis: Diagnosis Using Fine-Tuned Large Language Models - PMC ) ([Google Open-Sources ALBERT Natural Language Model - InfoQ](https://www.infoq.com/news/2020/01/google-albert-ai-nlp/#:~:text=uses%20two%20optimizations%20to%20reduce,BERT%20model%20with%20334M%20parameters)).