MLC (Orin & Xavier) - corupta/jetson-containers-jp5 GitHub Wiki
- Xavier(JP5): TODO
Orin(JP6): corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 9b5ebf5
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 jetson-containers build mlc:0.20.0
Xavier(JP5): corupta/mlc:0.20.0-buggyfi-r35.6.1-cp312-cu124-22.04
Ok we have both flashinfer and mlc installed, but it can't utilize flashinfer fully, due to idk really. Might be a memory issue with flashinfer. Might be fixable at runtime if the generated plans are somehow forced to use kernels with lower CTA_TILE_Q
. I gave up after a while, sharing the image so that if anyone cares, they might try it. This image also contains my AWQ patch for qwen3. If you run pip uninstall flashinfer
and try to compile an image (or just use --flashinfer 0
) you will see The model will fallback to TIR-based KV cache.
and it actually runs, but slower than flashinfer alternative and I think prefill length is highly limited, (no pages as far as I understood but I might be wrong). If you compile with flashinfer, it still succeeds but you get a cuda error when it tries to allocate KV cache with generated flashinfer kernels.
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout ???
LSB_RELEASE=22.04 CUDA_VERSION=12.4 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7.1 NUMPY_VERSION=1 jetson-containers build mlc:0.20.0
Some more relevant commands :)
docker run -dit --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
-e HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN} \
-e HF_HUB_CACHE=/root/.cache/huggingface \
corupta/mlc:0.20.0-buggyfi-r35.6.1-cp312-cu124-22.04
docker exec -it [container-id] bash
export TVM_HOME=/opt/mlc-llm/3rdparty/tvm
export MLC_MODEL_PATH=/root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/
export MODEL_LIB=${MLC_MODEL_PATH}aarch64-cu124-sm72.so
mlc_llm compile --device cuda --opt O3 $MLC_MODEL_PATH -o $MODEL_LIB \
--debug-dump /root/.cache/mlc_llm/debug \
--overrides='prefill_chunk_size=2048' \
--flashinfer 0
mlc_llm serve --mode interactive --device cuda \
--host 0.0.0.0 --port 9000 \
--model-lib $MODEL_LIB $MLC_MODEL_PATH
- Above MLC:0.20.0 versions include my patch introducing AWQ quantization for Qwen3 model.
- I was able to build mlc without flashinfer for Xavier, not yet published.
- I patched flashinfer many many times in order to make it work on Xavier. I was able to build mlc with flashinfer but it was somewhat problematic in runtime. It might have to do with share memory size per block, digging deeper at the moment.
TODO NOT READY YET
docker run -dit --rm \
--name llm_server \
--gpus all \
-p 9000:9000 \
-e DOCKER_PULL=always --pull always \
-e HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
mlc:r35.6.1-cp312-cu124-22.04 \
sudonim serve \
--model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
--quantization q4f16_awq \
--max-batch-size 1 \
--host 0.0.0.0 \
--port 9000
Deepseek R1 0528 Distilled Qwen3 (int4 awq) Example on Xavier AGX 32GB
TODO TAKE A VIDEO RECORDING
Docker Run Command
docker run -dit --rm \
--name llm_server \
--gpus all \
-p 9000:9000 \
-e DOCKER_PULL=always --pull always \
-e HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \
sudonim serve \
--model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
--quantization q4f16_awq \
--max-batch-size 1 \
--host 0.0.0.0 \
--port 9000
Deepseek R1 0528 Distilled Qwen3 (int4 awq) Example on Orin AGX 64GB
TODO TAKE A VIDEO RECORDING
✅ Build MLC with flashinfer for Orin, tag it as 0.20.0
Build succeeded for Orin. Flashinfer is problematic in dustynv images (It says falling back to TIR-based KVCache when models are run) but works fine here. Also, AWQ MLC Deepseek Qwen3 model works much better than any other quant.❌ Build MLC with flashinfer for Jetson, tag it as 0.20.0
Yeah I built it, it successfully compiled when given a model, but when it ran, some cuda terror happened. :) It might be that the current version of flashinfer kernels try to use too much shared memory per block. Trying to validate that this is the issue at the moment.CUDA Error: unspecified launch failure (719) /usr/local/lib/python3.12/dist-packages/flashinfer/data/include/flashinfer/attention/prefill.cuh: line 2466 at function cudaLaunchKernel((void*)kernel, nblks, nthrs, args, smem_size, stream)
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.12/dist-packages/tvm/_ffi/base.py", line 468, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (8) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x231d5e0) [0xffff8867d5e0]
[bt] (7) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x231d4d4) [0xffff8867d4d4]
[bt] (6) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionWithFusedQKV(long, tvm::runtime::NDArray, tvm::runtime::Optional<tvm::runtime::NDArray>, tvm::runtime::NDArray, double)+0x77c) [0xffff886bbf2c]
[bt] (5) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionInternal(long, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x130) [0xffff886bb500]
[bt] (4) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::MHASelfAttnInternal(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x19c) [0xffff886badfc]
[bt] (3) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::FlashInferRaggedPrefillFunc::MHA(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, tvm::runtime::relax_vm::RoPEMode, double, double, double, tvm::runtime::NDArray, tvm::runtime::NDArray, void*)+0x1d0) [0xffff8865d680]
[bt] (2) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x22bde18) [0xffff8861de18]
[bt] (1) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(TVMThrowLastError+0x4a4) [0xffff885d4eb8]
[bt] (0) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff88620390]
[bt] (8) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionWithFusedQKV(long, tvm::runtime::NDArray, tvm::runtime::Optional<tvm::runtime::NDArray>, tvm::runtime::NDArray, double)+0x77c) [0xffff886bbf2c]
[bt] (7) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::AttentionInternal(long, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x130) [0xffff886bb500]
[bt] (6) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::PagedAttentionKVCacheObj::MHASelfAttnInternal(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, double)+0x19c) [0xffff886badfc]
[bt] (5) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::relax_vm::FlashInferRaggedPrefillFunc::MHA(tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, tvm::runtime::NDArray, bool, tvm::runtime::relax_vm::RoPEMode, double, double, double, tvm::runtime::NDArray, tvm::runtime::NDArray, void*)+0x1d0) [0xffff8865d680]
[bt] (4) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(+0x22bdd18) [0xffff8861dd18]
[bt] (3) /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu124-sm72.so(batch_prefill_with_ragged_kv_cache_run+0x5b0) [0xffff3c7678b0]
[bt] (2) /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu124-sm72.so(BatchPrefillWithRaggedKVCacheRun(DLTensor*, DLTensor*, tvm::runtime::ShapeTuple, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, DLTensor*, long, long, long, long, double, double, double, void*)+0x23fc) [0xffff3c77cc80]
[bt] (1) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff871b4d78]
[bt] (0) /usr/local/lib/python3.12/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff88620390]
File "/root/.cache/flashinfer/72_87/generated/batch_prefill_tvm_dtype_q_float16_dtype_kv_float16_dtype_o_float16_qk_head_dim_128_v_head_dim_128_enable_inline_rope_False/batch_prefill.cu", line 218
TVMError: Check failed: (status == cudaSuccess) is false: BatchPrefillWithRaggedKVCache failed with error unspecified launch failure