Docker Images

Xavier(JP5): TODO

Orin(JP6): corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04

git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 9b5ebf5
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 jetson-containers build mlc:0.20.0

Old Images

Xavier(JP5): corupta/mlc:0.20.0-buggyfi-r35.6.1-cp312-cu124-22.04

Ok we have both flashinfer and mlc installed, but it can't utilize flashinfer fully, due to idk really. Might be a memory issue with flashinfer. Might be fixable at runtime if the generated plans are somehow forced to use kernels with lower CTA_TILE_Q. I gave up after a while, sharing the image so that if anyone cares, they might try it. This image also contains my AWQ patch for qwen3. If you run pip uninstall flashinfer and try to compile an image (or just use --flashinfer 0) you will see The model will fallback to TIR-based KV cache. and it actually runs, but slower than flashinfer alternative and I think prefill length is highly limited, (no pages as far as I understood but I might be wrong). If you compile with flashinfer, it still succeeds but you get a cuda error when it tries to allocate KV cache with generated flashinfer kernels.

git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout ???
LSB_RELEASE=22.04 CUDA_VERSION=12.4 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7.1 NUMPY_VERSION=1 jetson-containers build mlc:0.20.0

Some more relevant commands :)

docker run -dit --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
  -e HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  corupta/mlc:0.20.0-buggyfi-r35.6.1-cp312-cu124-22.04

docker exec -it [container-id] bash

export TVM_HOME=/opt/mlc-llm/3rdparty/tvm
export MLC_MODEL_PATH=/root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/
export MODEL_LIB=${MLC_MODEL_PATH}aarch64-cu124-sm72.so

mlc_llm compile --device cuda --opt O3 $MLC_MODEL_PATH -o $MODEL_LIB \
  --debug-dump /root/.cache/mlc_llm/debug \
  --overrides='prefill_chunk_size=2048' \
  --flashinfer 0

mlc_llm serve --mode interactive --device cuda \
  --host 0.0.0.0 --port 9000 \
  --model-lib $MODEL_LIB $MLC_MODEL_PATH

Notes

Above MLC:0.20.0 versions include my patch introducing AWQ quantization for Qwen3 model.
I was able to build mlc without flashinfer for Xavier, not yet published.
I patched flashinfer many many times in order to make it work on Xavier. I was able to build mlc with flashinfer but it was somewhat problematic in runtime. It might have to do with share memory size per block, digging deeper at the moment.

Jetson Xavier Series Example

TODO NOT READY YET

docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  mlc:r35.6.1-cp312-cu124-22.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000

Deepseek R1 0528 Distilled Qwen3 (int4 awq) Example on Xavier AGX 32GB

TODO TAKE A VIDEO RECORDING

Jetson Orin Series Example

Docker Run Command

docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000

Deepseek R1 0528 Distilled Qwen3 (int4 awq) Example on Orin AGX 64GB

TODO TAKE A VIDEO RECORDING

Roadmap

✅ Build MLC with flashinfer for Orin, tag it as 0.20.0

Build succeeded for Orin. Flashinfer is problematic in dustynv images (It says falling back to TIR-based KVCache when models are run) but works fine here. Also, AWQ MLC Deepseek Qwen3 model works much better than any other quant.

❌ Build MLC with flashinfer for Jetson, tag it as 0.20.0

Yeah I built it, it successfully compiled when given a model, but when it ran, some cuda terror happened. :) It might be that the current version of flashinfer kernels try to use too much shared memory per block. Trying to validate that this is the issue at the moment.

CUDA Error: unspecified launch failure (719) /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/prefill.cuh: line 2466 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.12/dist-packages/tvm/_ffi/base.py", line 468, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (8) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x231d5e0) [0xffff8867d5e0]
  [bt] (7) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x231d4d4) [0xffff8867d4d4]
  [bt] (6) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionWithFusedQKV(long, tvm::runtime::NDArray, tvm::runtime::Optional<tvm::runtime::NDArray>, tvm::runtime::NDArray, double)+0x77c) [0xffff886bbf2c]
  [bt] (5) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionInternal(long, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x130) [0xffff886bb500]
  [bt] (4) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::MHASelfAttnInternal(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x19c) [0xffff886badfc]
  [bt] (3) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::FlashInferRaggedPrefillFunc::MHA(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, tvm::runtime::relax_vm::RoPEMode, double, double, double, tvm::runtime::NDArray, tvm::runtime::NDArray, void*)+0x1d0) [0xffff8865d680]
  [bt] (2) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x22bde18) [0xffff8861de18]
  [bt] (1) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(TVMThrowLastError+0x4a4) [0xffff885d4eb8]
  [bt] (0) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff88620390]
  [bt] (8) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionWithFusedQKV(long, tvm::runtime::NDArray, tvm::runtime::Optional<tvm::runtime::NDArray>, tvm::runtime::NDArray, double)+0x77c) [0xffff886bbf2c]
  [bt] (7) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionInternal(long, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x130) [0xffff886bb500]
  [bt] (6) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::MHASelfAttnInternal(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x19c) [0xffff886badfc]
  [bt] (5) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::FlashInferRaggedPrefillFunc::MHA(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, tvm::runtime::relax_vm::RoPEMode, double, double, double, tvm::runtime::NDArray, tvm::runtime::NDArray, void*)+0x1d0) [0xffff8865d680]
  [bt] (4) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x22bdd18) [0xffff8861dd18]
  [bt] (3) /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu124-sm72.so(batch_prefill_with_ragged_kv_cache_run+0x5b0) [0xffff3c7678b0]
  [bt] (2) /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu124-sm72.so(BatchPrefillWithRaggedKVCacheRun(DLTensor*, DLTensor*, tvm::runtime::ShapeTuple, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, long, long, long, long, double, double, double, void*)+0x23fc) [0xffff3c77cc80]
  [bt] (1) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff871b4d78]
  [bt] (0) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff88620390]
  File "/root/.cache/flashinfer/72_87/generated/batch_prefill_tvm_dtype_q_float16_dtype_kv_float16_dtype_o_float16_qk_head_dim_128_v_head_dim_128_enable_inline_rope_False/batch_prefill.cu", line 218
TVMError: Check failed: (status == cudaSuccess) is false: BatchPrefillWithRaggedKVCache failed with error unspecified launch failure

MLC (Orin & Xavier) - corupta/jetson-containers-jp5 GitHub Wiki

Docker Images

Old Images

Notes

Jetson Xavier Series Example

Jetson Orin Series Example

Roadmap

⚠️ GitHub.com Fallback ⚠️

MLC (Orin & Xavier) - corupta/jetson-containers-jp5 GitHub Wiki

Docker Images

Old Images

Notes

Jetson Xavier Series Example

Jetson Orin Series Example

Roadmap

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️