TensorRT LLM (Orin) - corupta/jetson-containers-jp5 GitHub Wiki
- Xavier(JP5): NOPE
Orin(JP6): READY corupta/tensorrt_llm:0.21.1-r36.4-cp312-cu128-24.04
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 996ff24
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 TENSORRT_VERSION=10.7 jetson-containers build tensorrt_llm:0.21.1
corupta/tensorrt_llm:0.21.0-r36.4-cp312-cu128-24.04
git clone https://github.com/corupta/jetson-containers-jp5.git && cd jetson-containers-jp5 && git checkout 8ca4c27
LSB_RELEASE=24.04 CUDA_VERSION=12.8 PYTHON_VERSION=3.12 PYTORCH_VERSION=2.7 NUMPY_VERSION=1 TENSORRT_VERSION=10.7 jetson-containers build tensorrt_llm:0.21.0
- 0.21.0 corresponds to 0.21.0rc1 commit NVIDIA/TensorRT-LLM#9c012d5
- 0.21.1 corresponds to 5d4ab47 commit NVIDIA/TensorRT-LLM#5d4ab47
- I spent almost 50 hours trying to make newer versions of TensorRT work to get this one running on Xavier, but no, we can only use TensorRT 8.6 on it; libnvinfer is not open source, and we can't patch it easily. There remains a slight chance that I can patch the whole TensorRT-LLM to make it run on Xavier, but no promises.
docker run deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
In the end, I decided that the unquantized version of this model is better at coding than other alternatives. For performance one can prefer Intel Autoround version, which is the best quantized version I saw. (even int8 versions are worse than autoround int4) TensorRT-LLM seems promising to run the unquantized version fast enough.
docker run -dit --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-e HF_TOKEN=${HUGGINGFACE_TOKEN} \
-e TORCHINDUCTOR_CACHE_DIR=/root/.cache/torchinductor_root \
corupta/tensorrt_llm:0.21.1-r36.4-cp312-cu128-24.04 \
trtllm-serve \
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B \
--host 0.0.0.0 \
--port 9000 \
--backend pytorch \
--max_batch_size 1 \
--kv_cache_free_gpu_memory_fraction 0.95
Deepseek R1 0528 Distilled Qwen3 (bf16) Example
trt_llm_orin.mp4
Results after trying 4 times with the temperature set to 0.6