ONNX segmentation model transfer documentation - uiuc-iml/IML-Perception-Box GitHub Wiki

This document outlines requirements and best practices for converting your semantic segmentation models into ONNX format for deployment on the Perception Box. The primary goal is to enable consistent inference using ONNX Runtime, independent of Python preprocessing pipelines.

Core Requirement: Input and Output Format

For compatibility with the Perception Box inference engine, ONNX models must adhere to the following input and output interface specifications:

RGB Input: A tensor of shape (H, W, 3) representing the RGB image, where H and W are camera-specific dimensions. The expected data type is uint8, with values in the range [0, 255].
Depth Input: A tensor of shape (H, W) representing the raw depth image in millimeters, of type float32. The format and scale are assumed to match the raw depth output from the specific depth camera used.
Output: A tensor of shape (H, W, C) containing per-pixel semantic class logits, where C denotes the number of semantic categories.

ONNX models that conform to this I/O signature can be deployed directly, regardless of the internal architecture or preprocessing logic used prior to export.

Suggested Style: `torch.nn.Module` Wrapper

While any method that meets the I/O spec is allowed, an easily reproduceable method we’ve used is to create a wrapper as a torch.nn.Module derived class. This approach is proven to work for:

Hugging Face transformer-based segmentation models
ESANet and its fine-tuned variants

This lets you move all preprocessing inside the model and export it as a single self-contained unit.

Wrapper Implementation Pattern

1. Define a `torch.nn.Module` class:

class YourONNXWrapper(nn.Module):
    def __init__(self):
        super().__init__()
        ...

2. Inside `forward()`, convert inputs:

Assume input shapes:
- rgb: [H, W, 3]
- depth: [H, W]
Normalize inputs using pure PyTorch ops:

rgb = rgb.float() / 255.0
rgb = (rgb - mean) / std

depth = (depth - depth_mean) / depth_std

rgb = rgb.permute(2, 0, 1).unsqueeze(0)     # [1, 3, H, W]
depth = depth.unsqueeze(0).unsqueeze(0)     # [1, 1, H, W]

3. Run model and return softmax output:

logits = self.model(rgb, depth)
return probs.squeeze(0).permute(1, 2, 0).contiguous()

ONNX Export Rules

Allowed in ONNX:

Pure PyTorch tensor operations: +, *, view, permute, interpolate, softmax, etc.
Any torch.nn.functional or torch.nn.Module ops.
self.register_buffer(...) for storing constants like mean/std.

Not Allowed:

NumPy or OpenCV (.numpy(), cv2, etc.)
Python control flow (if, for, try) involving tensor values.

Exporting

Use:

torch.onnx.export(
    model,
    (rgb_tensor, depth_tensor),
    "model.onnx",
    input_names=["rgb", "depth"],
    output_names=["segmentation"],
    dynamic_axes={
        "rgb": {0: "batch", 1: "height", 2: "width"},
        "depth": {0: "batch", 1: "height", 2: "width"},
        "segmentation": {0: "height", 1: "width", 2: "classes"}
    },
    opset_version=12
)

However, not all models support dynamic_axes during export. For example, ESANet uses internal control flow and hard-coded tensor operations (like F.interpolate with int(tensor.shape[i] * scale)) that depend on static sizes. These operations result in ONNX symbolic tracing failures or runtime shape mismatches when exported with dynamic input shapes.