TensorRT LLM (Orin) - corupta/jetson-containers-jp5 GitHub Wiki

Docker Images

  • Xavier(JP5): NOPE
Orin(JP6): READY corupta/tensorrt_llm:0.21.1-r36.4-cp312-cu128-24.04
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 996ff24
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 TENSORRT_VERSION=10.7 jetson-containers build tensorrt_llm:0.21.1

Old Images

corupta/tensorrt_llm:0.21.0-r36.4-cp312-cu128-24.04
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 8ca4c27
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 TENSORRT_VERSION=10.7 jetson-containers build tensorrt_llm:0.21.0

Notes

  • 0.21.0 corresponds to 0.21.0rc1 commit NVIDIA/TensorRT-LLM#9c012d5
  • 0.21.1 corresponds to 5d4ab47 commit NVIDIA/TensorRT-LLM#5d4ab47
  • I spent almost 50 hours trying to make newer versions of TensorRT work to get this one running on Xavier, but no, we can only use TensorRT 8.6 on it; libnvinfer is not open source, and we can't patch it easily. There remains a slight chance that I can patch the whole TensorRT-LLM to make it run on Xavier, but no promises.

Jetson Orin Series Example

docker run deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

In the end, I decided that the unquantized version of this model is better at coding than other alternatives. For performance one can prefer Intel Autoround version, which is the best quantized version I saw. (even int8 versions are worse than autoround int4) TensorRT-LLM seems promising to run the unquantized version fast enough.

docker run -dit --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -e HF_TOKEN=${HUGGINGFACE_TOKEN} \
  -e TORCHINDUCTOR_CACHE_DIR=/root/.cache/torchinductor_root \
  corupta/tensorrt_llm:0.21.1-r36.4-cp312-cu128-24.04 \
  trtllm-serve \
  deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
  --host 0.0.0.0 \
  --port 9000 \
  --backend pytorch \
  --max_batch_size 1 \
  --kv_cache_free_gpu_memory_fraction 0.95
Deepseek R1 0528 Distilled Qwen3 (bf16) Example
trt_llm_orin.mp4

Results after trying 4 times with the temperature set to 0.6

trt_llm_orin2.mp4

Roadmap

❌ Build tensorrt-llm:0.21.0 for Xavier It might not be a good idea :/
✅ Build tensorrt-llm:0.21.0 for Orin Even building it for Orin required me to do several patches. I successfully built it once and had some runtime issues. I'm currently fixing those issues to reach a stable version, yet pypi went down while doing so. I'll continue when it is up.
✅ Re-build tensorrt-llm:0.21.1 for Orin So, I saw that tensorrt llm fused attention kernels were problematic, so I found that they are built for sm86 but not sm87. My patch is to roughly replace almost all 86 with 87 in custom tensorrt llm kernels :). Fun fact, while I was patching the latest commit, they just added a new commit, but its changes won't matter much for now. (They are improving MOE, it'd be super useful once Qwen3 Coder 30b a3b is released, it'd be the best model to run in Jetson Thor for coding prolly :)) Took several days to fix those kernels. Ok, the build is complete, and pushed to dockerhub. Fused kernels work, need to see if speculative decoding works, if not might need to create a new build for that.
🔄 Re-build tensorrt-llm new stable version for Orin It might be good to implement mmep feature from the v0.12-jetson branch. I think it will only affect trt llm cpp runtime (when torch backend is not used, but I'm not sure). Also, let's wait for TensorRT-LLM to create a more stable version as I see that the rope yarn implementation differs from that of sglang (regarding attn_factor, mscale, could be easily fixed, but let's let the tensorrt-llm people do that, along with many other things they develop)
🔄 Build tensorrt-llm:0.22.0 😄 for Thor (when both are released) Let's see :)
⚠️ **GitHub.com Fallback** ⚠️