Notes on onyx rpi conversion for hailo‐8 chip fun - krickert/search-api GitHub Wiki

Great, I’ll look into options for running NLP tasks—like sentence embeddings, NER, and sentiment analysis—on the Hailo-8 using a Raspberry Pi 5. I’ll focus on models like MiniLM or similar that can be converted to Hailo format. I’ll also find solid instructions and commands for compiling these models using the Hailo SDK.

I'll get back to you shortly with models, conversion steps, and example code for inference in Python.

Deploying Lightweight NLP Models on Hailo-8 (Raspberry Pi 5 AI Kit)

Candidate Pretrained NLP Models for Hailo-8

To fit the constraints of the Hailo-8 accelerator, we should choose compact transformer models that have far fewer parameters than BERT-base (110M) and can be quantized to 8-bit. The following are suitable candidates (all can be exported to ONNX for Hailo compilation):

DistilBERT – A 6-layer distillation of BERT-base with ~66 million parameters (40% of BERT-base) and ~60% faster inference while retaining ~97% of BERT’s accuracy ([[1910.01108] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://ar5iv.labs.arxiv.org/html/1910.01108#:~:text=ELMo%20180%20895%20BERT,668%20DistilBERT%2066%20410)) ([[1910.01108] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://ar5iv.labs.arxiv.org/html/1910.01108#:~:text=Table%C2%A03,faster%20than%20BERT)). DistilBERT can be fine-tuned for classification, NER, etc., and ONNX versions are available (e.g. Hugging Face provides ONNX exports for DistilBERT).
MiniLM (e.g. MiniLM-L6-H384) – A 6-layer, 384-dimensional mini model (~22M parameters) from Microsoft, which achieves about a 5.3× speedup over BERT-base ([microsoft/xtremedistil-l6-h384-uncased · Hugging Face](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased#:~:text=This%20l6,base)). For example, all-MiniLM-L6-v2 (22M params) is popular for sentence embeddings ([How do Sentence Transformers relate to large language models like ...](https://milvus.io/ai-quick-reference/how-do-sentence-transformers-relate-to-large-language-models-like-gpt-and-are-sentence-transformer-models-typically-smaller-or-more-specialized#:~:text=Sentence%20Transformer%20models%20are%20typically,has%2022%20million%20parameters%2C)). MiniLM models can also be fine-tuned for classification or NER.
TinyBERT – An extremely compact BERT via knowledge distillation. The smallest variant (4-layer) has only ~14 million parameters ( Urinary Bladder Acute Inflammations and Nephritis of the Renal Pelvis: Diagnosis Using Fine-Tuned Large Language Models - PMC ), making it ideal for edge deployment with some accuracy trade-off. TinyBERT can be used for tasks like sentiment or NER (with fine-tuning) in resource-constrained settings.
ALBERT-Base – A parameter-sharing “lite BERT” with 12 layers but only ~12M parameters (achieves an 89% reduction vs BERT) ([Google Open-Sources ALBERT Natural Language Model - InfoQ](https://www.infoq.com/news/2020/01/google-albert-ai-nlp/#:~:text=uses%20two%20optimizations%20to%20reduce,BERT%20model%20with%20334M%20parameters)). ALBERT uses factorized embeddings and layer parameter-sharing to drastically cut memory, while still performing well on classification tasks.

Other options: MobileBERT (~25M) or ELECTRA-Small (~14M) are also viable if an ONNX export is available. All these models are compatible with ONNX export (via Hugging Face Transformers or ONNX Model Zoo), which is the first step toward running them on Hailo-8.

Converting Models to Hailo Format (HEF)

To run these models on Hailo-8, convert the pretrained model to a Hailo Executable Format (.hef) using the Hailo Dataflow Compiler. The high-level steps are:

Obtain an ONNX model: You can download a pretrained ONNX if available (e.g. Hugging Face Hub often hosts ONNX versions of models), or export the model using transformers.onnx or optimum. For example, export DistilBERT to ONNX with Hugging Face Transformers transformers.onnx or torch.onnx.export. Ensure the ONNX includes the model’s full graph (for a fine-tuned model, from input IDs to output logits).
(Optional) Simplify the ONNX: It’s recommended to run [onnx-simplifier](https://github.com/daquexian/onnx-simplifier) on the model before compilation ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Parsing%20)). This removes redundant ops and resolves dynamic shapes, which helps the Hailo compiler. For example:
```
pip install onnx-simplifier
python -m onnxsim model.onnx model_simplified.onnx
```
Set up Hailo Dataflow Compiler (DFC): Model compilation must be done on a PC, not on the Pi (the process is compute-intensive) ([Onnx -> hef conversion help - General - Hailo Community](https://community.hailo.ai/t/onnx-hef-conversion-help/6453#:~:text=Please%20download%20the%20Hailo%20AI,platform%20like%20the%20Raspberry%20Pi)). Install the Hailo AI SDK or use the Docker image from Hailo Developer Zone (which includes the DFC). Activate the Hailo DFC environment (e.g. via the provided virtualenv or Docker). You can launch hailo tutorial inside the environment to open Jupyter notebooks with step-by-step guides ([Custom ONNX models on H8L Raspberry - General - Hailo Community](https://community.hailo.ai/t/custom-onnx-models-on-h8l-raspberry/1368#:~:text=,ONNX%20to%20HEF%20for%20inference)) if needed.
Prepare a calibration dataset: Hailo-8 uses post-training quantization (INT-8/INT-4). Gather a small representative sample of input data (e.g. a list of typical sentences) to use for calibration. For an NLP model, this could be a set of tokenized inputs (numpy arrays of shape [batch, seq_len] for input IDs, etc.). Real data is best for calibration ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Optimization%5Cquantization%20,Compilation)); if unavailable, you can use --use-random-calib-set for a quick test, but expect lower accuracy. Save your calibration inputs as a NumPy .npy file if using the CLI.
Parse the ONNX to Hailo format: Use the Hailo parser to convert ONNX into a Hailo Archive (HAR) file. For example, in the DFC environment run:
```
hailo parser onnx model_simplified.onnx -o model.har
```
This produces model.har (an intermediate Hailo model representation).
Optimize (Quantize) the HAR: Quantize and optimize the model for Hailo-8. For example:
```
hailo optimize model.har --hw-arch hailo8 \
    --calib-set-path calib_data.npy \
    --output-har-path model_q.har
```
This uses the calibration set to quantize weights/activations (default 8-bit). Replace hailo8 with hailo8l if you have the 8L variant (the Raspberry Pi HAT uses Hailo-8L). If you don’t have a calibration file, you can add --use-random-calib-set to use random data for calibration ([Convert .onxx to .hef using the CLI - General - Hailo Community](https://community.hailo.ai/t/convert-onxx-to-hef-using-the-cli/9106#:~:text=%24,set)) (not ideal for final accuracy, but acceptable to test the flow).
Compile to HEF: Finally, compile the quantized HAR for the target hardware:
```
hailo compiler model_q.har --hw-arch hailo8 -o model.hef
```
This produces model.hef, which is the binary that can be loaded onto the Hailo-8. (Use --hw-arch hailo8l if appropriate ([Hailo AI Platform Guide - Hailo-8 AI Software Suite Commands](https://developer.ridgerun.com/wiki/index.php/Hailo/Hailo-8/AI_Software_and_Tools/Hailo_Commands#:~:text=%2A%20,path%20to%20save%20the%20auto)).) If the model is very large or sequence length is high, the compiler might error (e.g. “No valid partition found”) – in that case consider reducing the sequence length or model size and re-try.

Note: Hailo also provides a Model Zoo and SDK that can automate some steps. For instance, hailomz compile ... can parse/quantize/compile in one go, but it typically expects a known model or a custom YAML/ALLS script for your network. For generic models, using the raw hailo CLI as above (parse → optimize → compile) is straightforward ([Hailo-8 model conversion - CLI - General - Hailo Community](https://community.hailo.ai/t/hailo-8-model-conversion-cli/12004#:~:text=Welcome%20to%20the%20Hailo%20Community%21)) ([Hailo-8 model conversion - CLI - General - Hailo Community](https://community.hailo.ai/t/hailo-8-model-conversion-cli/12004#:~:text=,hef%20using%20the%20CLI%20tools)). Ensure all model layers are supported by Hailo-8 (standard Transformer ops like MatMul, Add, LayerNorm, etc. are supported, but some ONNX export quirks like certain activation or reshape ops may need adjustment).

Running Inference on Hailo-8 with HailoRT

With the compiled model.hef, you can perform inference on the Raspberry Pi 5 + Hailo-8 using the HailoRT runtime API. You’ve already installed hailort-4.21.0 (Python wheel for aarch64), which provides the hailo_platform Python API. Below is a minimal Python example showing how to load a HEF and run a inference on text input:

import numpy as np
import hailo_platform as hp

# 1. Open the HEF and configure the Hailo-8 device
hef = hp.HEF("model.hef")  # compiled model file
with hp.VDevice() as device:  
    # Configure the device with this HEF (using PCIe interface for the Hailo-8 on RPi)
    cfg = hp.ConfigureParams.create_from_hef(hef, interface=hp.HailoStreamInterface.PCIe)
    network_group = device.configure(hef, cfg)[0]
    network_params = network_group.create_params()

    # 2. Prepare input data (example: sentiment analysis with DistilBERT)
    # Assume the model has two input vstreams: input_ids and attention_mask, and one output (logits)
    input_infos = hef.get_input_vstream_infos()
    output_infos = hef.get_output_vstream_infos()
    # Example dummy text
    text = "Hailo-8 is incredibly efficient!"
    # Tokenize the text to get input IDs and attention mask (using a tokenizer, not shown for brevity)
    # For demonstration, we'll use dummy arrays of appropriate shape:
    max_len = 128
    input_ids = np.zeros((1, max_len), dtype=np.int32)       # e.g. padded token IDs
    attention_mask = np.zeros((1, max_len), dtype=np.int32)  # e.g. mask (1s for tokens, 0s for padding)
    # (In practice, use Hugging Face tokenizer: 
    #   tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
    #   enc = tokenizer(text, return_tensors='np', padding='max_length', max_length=128)
    #   input_ids = enc["input_ids"]; attention_mask = enc["attention_mask"] )

    # 3. Run inference on Hailo-8
    with network_group.activate(network_params):
        # Set up input/output streams (using float32 format; runtime will quantize as needed)
        in_params = hp.InputVStreamParams.make_from_network_group(network_group, quantized=False)
        out_params = hp.OutputVStreamParams.make_from_network_group(network_group, quantized=False)
        with hp.InferVStreams(network_group, in_params, out_params) as infer_pipeline:
            # Create dictionary for inputs (using vstream names from the HEF):
            inputs = {
                input_infos[0].name: input_ids.astype(np.float32), 
                input_infos[1].name: attention_mask.astype(np.float32)
            }
            results = infer_pipeline.infer(inputs)
            output = results[output_infos[0].name]
            print("Model output:", output)

In this example, we prepare dummy input_ids and attention_mask for a DistilBERT-like model expecting two inputs. In practice, you’d use a tokenizer to produce these from real text. The HailoRT API then:

Loads the HEF and configures a virtual device (VDevice) which manages the Hailo-8.
Activates a network group (the compiled model).
Creates input/output vstreams with quantized=False and FormatType.FLOAT32 so that you can feed normal NumPy float32/int32 arrays; HailoRT will handle quantization internally based on the calibration scales ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=input_vstreams_params%20%3D%20hpf,FLOAT32)) ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=for%20_%20in%20range,Inference%20output%3A%20%7Boutput_data)).
Runs infer with a dictionary of input name to data. This returns a dictionary of output name to output tensor.

For a real test, replace the dummy data with actual tokenized input. For example, using Hugging Face on the Pi to tokenize:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
enc = tokenizer(text, return_tensors='np', padding='max_length', max_length=128, truncation=True)
inputs = {
    input_infos[0].name: enc["input_ids"].astype(np.float32),
    input_infos[1].name: enc["attention_mask"].astype(np.float32)
}
results = infer_pipeline.infer(inputs)
logits = results[output_infos[0].name]
pred_label = np.argmax(logits)  # e.g. 1 = positive sentiment

This would output the model’s logits for the example sentence, and you can interpret the result (e.g., index 1 = “positive”).

C++/Java: The HailoRT library also provides a C/C++ API (the Python hailo_platform is a wrapper). In C++, you would use similar steps – open the HEF, configure device, create input/output streams, and call network_group.infer() or stream data. Hailo’s SDK comes with C++ sample code for loading HEF and running inference (the logic mirrors the Python example above). While there is no official Java API, you could integrate via gRPC or JNI – for example, run a Python/C++ inference service and call it from Java. Since your future plan is to use gRPC, you can wrap the above inference code in a server that accepts text and returns the inference results.

Additional Setup Notes and Verification

HailoRT Driver & Firmware: Ensure the Hailo-8 is installed and firmware loaded. On Raspberry Pi, after boot you can check dmesg or run the hailortcli scan command to verify the device is detected. The hailortcli tool (installed with HailoRT) also has a run subcommand to execute a HEF on the device as a quick test ([Custom ONNX models on H8L Raspberry - General - Hailo Community](https://community.hailo.ai/t/custom-onnx-models-on-h8l-raspberry/1368#:~:text=To%20measure%20your%20model%E2%80%99s%20performance,Simply%20run%20the%20following%20command)) (it can use a dummy input or a provided input file).
Hailo Model Zoo: If you installed the Hailo Model Zoo package, you can use it to test some pre-compiled example models to verify your setup. (Note: as of now the Model Zoo focuses on vision models – there may not be NLP models there.) However, the Model Zoo’s infrastructure (the hailomz command) isn’t required for custom NLP models. It simply interfaces with the same Dataflow Compiler under the hood ([GitHub - hailo-ai/hailo_model_zoo: The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment](https://github.com/hailo-ai/hailo_model_zoo#:~:text=For%20full%20functionality%20please%20see,Full%20functionality%20includes)) ([Hailo-8 model conversion - CLI - General - Hailo Community](https://community.hailo.ai/t/hailo-8-model-conversion-cli/12004#:~:text=Welcome%20to%20the%20Hailo%20Community%21)). In our workflow, we used the CLI directly.
Quantization considerations: All models on Hailo-8 run with integer quantization (weights in int4/int8, activations int8/uint16) ([Custom ONNX models on H8L Raspberry - General - Hailo Community](https://community.hailo.ai/t/custom-onnx-models-on-h8l-raspberry/1368#:~:text=Regarding%20quantization%2C%20you%20have%20options,8b%2F16b%20uint%20on%20the%20data%2Factivation)) ([Custom ONNX models on H8L Raspberry - General - Hailo Community](https://community.hailo.ai/t/custom-onnx-models-on-h8l-raspberry/1368#:~:text=point,use%20integer%20computation%20only)). After conversion, you should verify the accuracy on the Hailo by comparing outputs with a CPU version. There might be a slight drop in accuracy due to quantization. If the drop is large, consider providing a larger or more domain-specific calibration set and recompiling (to better capture the distribution of embeddings, etc.).
Memory limitations: The Hailo-8 (especially the 8L variant) has limited on-chip memory. Very long sequence lengths or large vocab embeddings might not fit easily. If you encounter compilation issues, try reducing the max sequence length (e.g. 128 or 256 tokens) in the ONNX before conversion, or use an even smaller model. The Hailo compiler will partition the network across the device’s resources if possible, but extremely large models may simply exceed the device’s capability ([Convert a language model from onnx to hef - General - Hailo Community](https://community.hailo.ai/t/convert-a-language-model-from-onnx-to-hef/8495#:~:text=Currently%2C%20the%20Hailo,detection%2C%20image%20classification%2C%20and%20segmentation)) ([Convert a language model from onnx to hef - General - Hailo Community](https://community.hailo.ai/t/convert-a-language-model-from-onnx-to-hef/8495#:~:text=1,hybrid%20approach%20where%20you%20use)).
End-to-end test: Once you have the .hef, run an end-to-end test on the Pi. For example, use the Python code above with a test sentence. The printed output (e.g. an array of logits or an embedding vector) confirms that the model inference is working on the Hailo-8. You can then post-process the output: for a classification task, pick the max logit as the predicted class; for embeddings, you might compare them via cosine similarity, etc. This verifies the entire pipeline: ONNX -> HEF -> loaded on Hailo-8 -> inference -> result.

By following these steps, you will be able to deploy sentence embeddings, NER taggers, sentiment classifiers, or other text models on the Hailo-8. Start with a simple model like DistilBERT with a small sequence length to validate the flow, then iterate as needed. With the model running standalone on the Hailo-8, you’ll be well-prepared to integrate it behind a gRPC service in the next stage. Good luck with your Hailo-8 NLP deployment!

Sources: The Hailo Community forums and documentation were referenced for best practices on model conversion and HailoRT usage ([Dataflow compiler best practice - General - Hailo Community](https://community.hailo.ai/t/dataflow-compiler-best-practice/62#:~:text=Parsing%20)) ([Convert .onxx to .hef using the CLI - General - Hailo Community](https://community.hailo.ai/t/convert-onxx-to-hef-using-the-cli/9106#:~:text=I%20was%20already%20able%20to,I%20used%20were%20the%20following)) ([HailoRT minimal working example for Python and Hailo8 - Guides - Hailo Community](https://community.hailo.ai/t/hailort-minimal-working-example-for-python-and-hailo8/7685#:~:text=,np%20import%20hailo_platform%20as%20hpf)), and model information from HuggingFace and research papers for DistilBERT, MiniLM, TinyBERT, and ALBERT ([[1910.01108] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://ar5iv.labs.arxiv.org/html/1910.01108#:~:text=ELMo%20180%20895%20BERT,668%20DistilBERT%2066%20410)) ([microsoft/xtremedistil-l6-h384-uncased · Hugging Face](https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased#:~:text=This%20l6,base)) ( Urinary Bladder Acute Inflammations and Nephritis of the Renal Pelvis: Diagnosis Using Fine-Tuned Large Language Models - PMC ) ([Google Open-Sources ALBERT Natural Language Model - InfoQ](https://www.infoq.com/news/2020/01/google-albert-ai-nlp/#:~:text=uses%20two%20optimizations%20to%20reduce,BERT%20model%20with%20334M%20parameters)).