models Local documentation - Azure/azureml-assets GitHub Wiki

Local

Models in this category


  • deepseek-r1-distill-llama-8b

    This model is an optimized version of DeepSeek-R1-Distill-Llama-8B for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of ...

  • deepseek-r1-distill-llama-8b-cuda-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Llama-8B to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R...

  • deepseek-r1-distill-llama-8b-generic-cpu

    This model is an optimized version of DeepSeek-R1-Distill-Llama-8B to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dis...

  • deepseek-r1-distill-llama-8b-generic-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Llama-8B to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dis...

  • deepseek-r1-distill-qwen-1.5b

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B for local inference. Optimized models are published here in ONNX format to run on CPU, GPU, and NPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to e...

  • deepseek-r1-distill-qwen-1.5b-cuda-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-...

  • deepseek-r1-distill-qwen-1.5b-generic-cpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Di...

  • deepseek-r1-distill-qwen-1.5b-generic-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Di...

  • deepseek-r1-distill-qwen-1.5b-qnn-npu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B to enable local inference on QNN NPUs. This model uses QuaRot and GPTQ quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of th...

  • DeepSeek-R1-Distill-Qwen-1.5B-trtrtx-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-1.5B to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Distill-Qwen-1.5B for l...

  • deepseek-r1-distill-qwen-14b

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-14B for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of ...

  • deepseek-r1-distill-qwen-14b-cuda-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-14B to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R...

  • deepseek-r1-distill-qwen-14b-generic-cpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-14B to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dis...

  • deepseek-r1-distill-qwen-14b-generic-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-14B to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dis...

  • deepseek-r1-distill-qwen-14b-qnn-npu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-14B to enable local inference on QNN NPUs. This model uses QuaRot and GPTQ quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the...

  • deepseek-r1-distill-qwen-14b-trtrtx-gpu

    This model is an optimized version of deepseek-r1-distill-qwen-14b to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the deepseek-r1-distill-qwen-14b for loc...

  • deepseek-r1-distill-qwen-7b

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference. Optimized models are published here in ONNX format to run on CPU, GPU, and NPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited ...

  • deepseek-r1-distill-qwen-7b-cuda-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1...

  • deepseek-r1-distill-qwen-7b-generic-cpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dist...

  • deepseek-r1-distill-qwen-7b-generic-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Dist...

  • DeepSeek-R1-Distill-Qwen-7B-openvino-npu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Distill-Qwen-7B for local infere...

  • deepseek-r1-distill-qwen-7b-qnn-npu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on QNN NPUs. This model uses QuaRot and GPTQ quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the ...

  • DeepSeek-R1-Distill-Qwen-7B-trtrtx-gpu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Distill-Qwen-7B for local...

  • DeepSeek-R1-Distill-Qwen-7B-vitis-npu

    This model is an optimized version of DeepSeek-R1-Distill-Qwen-7B to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the DeepSeek-R1-Distill-Qwen-7B for local inferenc...

  • Mistral-7B-Instruct-v0-2-openvino-npu

    This model is an optimized version of Mistral-7B-Instruct-v0.2 to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Mistral-7B-Instruct-v0.2 for local infer...

  • Mistral-7B-Instruct-v0-2-vitis-npu

    This model is an optimized version of Mistral-7B-Instruct-v0.2 to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Mistral-7B-Instruct-v0.2 for local inferen...

  • mistralai-Mistral-7B-Instruct-v0-2-cuda-gpu

    This model is an optimized version of Mistral-7B-Instruct-v0.2 to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Mistral...

  • mistralai-Mistral-7B-Instruct-v0-2-generic-cpu

    This model is an optimized version of Mistral-7B-Instruct-v0.2 to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: apache-2.0

  • License: MIT

  • Model Description: This is a conversion of the Mistral-7B-In...

  • mistralai-Mistral-7B-Instruct-v0-2-generic-gpu

    This model is an optimized version of Mistral-7B-Instruct-v0.2 to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Mistral-7B-I...

  • openai-whisper-large-v3-turbo

    This model is an optimized version of Whisper Large V3 Turbo for local inference. Optimized models are published here in ONNX format to run on CPU, GPU, and NPU across devices, including server platforms, desktops, and mobile, with the precision best suited to each of these targets.

Review the...

  • openai-whisper-large-v3-turbo-cuda-gpu

    Whisper Large V3 Turbo is an advanced speech recognition model, optimized for high-performance GPU inference. It is suitable for automatic speech recognition (ASR) tasks in various domains, leveraging large-scale training data for robust multilingual transcription. This model is an optimized vers...

  • openai-whisper-large-v3-turbo-generic-cpu

    Whisper Large V3 Turbo is an advanced speech recognition model, optimized for high-performance CPU inference. It is suitable for automatic speech recognition (ASR) tasks in various domains, leveraging large-scale training data for robust multilingual transcription. This model is designed for scen...

  • openai-whisper-tiny-generic-cpu

    Whisper is an OpenAI pre-trained speech recognition model with potential applications for ASR solutions for developers. However, due to weak supervision and large-scale noisy data, it should be used with caution in high-risk domains. The model has been trained on 680k hours of audio data represen...

  • Phi-3-mini-128k-instruct-cuda-gpu

    This model is an optimized version of Phi-3-Mini-128K-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-128...

  • Phi-3-mini-128k-instruct-generic-cpu

    This model is an optimized version of Phi-3-Mini-128K-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-128K-Ins...

  • Phi-3-mini-128k-instruct-generic-gpu

    This model is an optimized version of Phi-3-Mini-128K-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-128K-Ins...

  • phi-3-mini-128k-instruct-qnn-npu

    This model is an optimized version of phi-3-mini-128k-instruct to enable local inference on QNN NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the phi-3-mini-128k-instruct for local inference on Q...

  • phi-3-mini-128k-instruct-trtrtx-gpu

    This model is an optimized version of phi-3-mini-128k-instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the phi-3-mini-128k-instruct for local infer...

  • phi-3-mini-128k-instruct-vitis-npu

    This model is an optimized version of Phi-3-mini-128k-instruct to enable local inference on AMD NPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-mini-128k...

  • Phi-3-mini-4k-instruct-cuda-gpu

    This model is an optimized version of Phi-3-Mini-4K-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-4K-In...

  • Phi-3-mini-4k-instruct-generic-cpu

    This model is an optimized version of Phi-3-Mini-4K-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-4K-Instruc...

  • Phi-3-mini-4k-instruct-generic-gpu

    This model is an optimized version of Phi-3-Mini-4K-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-4K-Instruc...

  • Phi-3-mini-4k-instruct-openvino-npu

    This model is an optimized version of Phi-3-Mini-4K-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-4K-Instruct for local inference on Int...

  • phi-3-mini-4k-instruct-qnn-npu

    This model is an optimized version of phi-3-mini-4k-instruct to enable local inference on QNN NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the phi-3-mini-4k-instruct for local inference on QNN N...

  • phi-3-mini-4k-instruct-trtrtx-gpu

    This model is an optimized version of phi-3-mini-4k-instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the phi-3-mini-4k-instruct for local inference...

  • Phi-3-mini-4k-instruct-vitis-npu

    This model is an optimized version of Phi-3-Mini-4K-Instruct to enable local inference on AMD NPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3-Mini-4K-Ins...

  • phi-3.5-mini-128k-instruct-qnn-npu

    This model is an optimized version of phi-3.5-mini-128k-instruct to enable local inference on QNN NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the phi-3.5-mini-128k-instruct for local inference ...

  • phi-3.5-mini-128k-instruct-trtrtx-gpu

    This model is an optimized version of Phi-3.5-mini-instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3.5-mini-instruct for local inference o...

  • Phi-3.5-mini-instruct-cuda-gpu

    This model is an optimized version of Phi-3.5-mini-instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3.5-mini-inst...

  • Phi-3.5-mini-instruct-generic-cpu

    This model is an optimized version of Phi-3.5-mini-instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3.5-mini-instruct ...

  • Phi-3.5-mini-instruct-generic-gpu

    This model is an optimized version of Phi-3.5-mini-instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-3.5-mini-instruct ...

  • Phi-4-cuda-gpu

    This model is an optimized version of Phi-4 to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4 for local inference on CUDA...

  • Phi-4-generic-cpu

    This model is an optimized version of Phi-4 to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4 for local inference on CPUs.

  • *...

  • Phi-4-generic-gpu

    This model is an optimized version of Phi-4 to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4 for local inference on GPUs.

  • *...

  • Phi-4-mini-instruct-cuda-gpu

    This model is an optimized version of Phi-4-mini-instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-instruct...

  • Phi-4-mini-instruct-generic-cpu

    This model is an optimized version of Phi-4-mini-instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-instruct for ...

  • Phi-4-mini-instruct-generic-gpu

    This model is an optimized version of Phi-4-mini-instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-instruct for ...

  • phi-4-mini-instruct-openvino-npu

    This model is an optimized version of Phi-4-mini-instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-instruct for local inference on Intel NPU...

  • phi-4-mini-instruct-vitis-npu

    This model is an optimized version of Phi-4-mini-instruct to enable local inference on AMD NPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-instruct ...

  • Phi-4-mini-reasoning-cuda-gpu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-reasoni...

  • Phi-4-mini-reasoning-generic-cpu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-reasoning fo...

  • Phi-4-mini-reasoning-generic-gpu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-reasoning fo...

  • Phi-4-mini-reasoning-openvino-npu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-reasoning for local inference on Intel NP...

  • Phi-4-mini-reasoning-qnn-npu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on QNN NPUs. This model uses QuaRot and GPTQ quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-m...

  • Phi-4-mini-reasoning-vitis-npu

    This model is an optimized version of Phi-4-mini-reasoning to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-mini-reasoning for local inference on AMD NPUs. ...

  • Phi-4-reasoning-cuda-gpu

    This model is an optimized version of Phi-4-reasoning to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-reasoning for loc...

  • Phi-4-reasoning-generic-cpu

    This model is an optimized version of Phi-4-reasoning to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-reasoning for local in...

  • Phi-4-reasoning-generic-gpu

    This model is an optimized version of Phi-4-reasoning to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4-reasoning for local in...

  • Phi-4-trtrtx-gpu

    This model is an optimized version of Phi-4 to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Phi-4 for local inference on TensorRT-RTX GPUs.

  • **Disclai...

  • qwen2.5-0.5b-instruct

    This model is an optimized version of Qwen2.5-0.5B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these t...

  • qwen2.5-0.5b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-0....

  • qwen2.5-0.5b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-0.5B-In...

  • qwen2.5-0.5b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-0.5B-In...

  • qwen2.5-0.5b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on Intel NPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-0...

  • qwen2.5-0.5b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on TensorRT-RTX GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qw...

  • qwen2.5-0.5b-instruct-vitis-npu

    This model is an optimized version of Qwen2.5-0.5B-Instruct to enable local inference on AMD NPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-0.5...

  • qwen2.5-1.5b-instruct

    This model is an optimized version of Qwen2.5-1.5B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these t...

  • qwen2.5-1.5b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-1....

  • qwen2.5-1.5b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-1.5B-In...

  • qwen2.5-1.5b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-1.5B-In...

  • qwen2.5-1.5b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-1.5B-Instruct for local inference o...

  • qwen2.5-1.5b-instruct-qnn-npu

    This model is an optimized version of qwen2.5-1.5b-instruct to enable local inference on QNN NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the qwen2.5-1.5b-instruct for local inference on QNN NPU...

  • qwen2.5-1.5b-instruct-test-openvino-npu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on Intel NPUs. This model uses post-training quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the...

  • qwen2.5-1.5b-instruct-test-qnn-npu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on Qualcomm NPUs. This model uses post-training quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of ...

  • qwen2.5-1.5b-instruct-test-vitis-npu

    This model is an optimized version of Qwen2.5-1.5B-Instruct to enable local inference on AMD NPUs. This model uses post-training quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Q...

  • qwen2.5-1.5b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-1.5b-instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the Qwen2.5-1.5b-instruct for local inference o...

  • qwen2.5-14b-instruct

    This model is an optimized version of Qwen2.5-14B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these ta...

  • qwen2.5-14b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-14B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-14B...

  • qwen2.5-14b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-14B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-14B-Inst...

  • qwen2.5-14b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-14B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-14B-Inst...

  • qwen2.5-14b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-14B-Instruct to enable local inference on TensorRT-RTX GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwe...

  • qwen2.5-3b-instruct

    This model is an optimized version of Qwen2.5-3B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these tar...

  • qwen2.5-3b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-3B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-3B-I...

  • qwen2.5-3b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-3B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-3B-Instru...

  • qwen2.5-3b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-3B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-3B-Instru...

  • qwen2.5-7b-instruct

    This model is an optimized version of Qwen2.5-7B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these tar...

  • qwen2.5-7b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-I...

  • qwen2.5-7b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-Instru...

  • qwen2.5-7b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-Instru...

  • qwen2.5-7b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-Instruct for local inference on In...

  • qwen2.5-7b-instruct-qnn-npu

    This model is an optimized version of qwen2.5-7b-instruct to enable local inference on QNN NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: MIT

  • Model Description: This is a conversion of the qwen2.5-7b-instruct for local inference on QNN NPUs. -...

  • qwen2.5-7b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-Instruct for local inferenc...

  • qwen2.5-7b-instruct-vitis-npu

    This model is an optimized version of Qwen2.5-7B-Instruct to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-7B-Instruct for local inference on AMD ...

  • qwen2.5-coder-0.5b-instruct

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of t...

  • qwen2.5-coder-0.5b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen...

  • qwen2.5-coder-0.5b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-C...

  • qwen2.5-coder-0.5b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-C...

  • qwen2.5-coder-0.5b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-0.5B-Instruct for local...

  • qwen2.5-coder-0.5b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on TensorRT-RTX GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of ...

  • qwen2.5-coder-0.5b-instruct-vitis-npu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-0.5B-Instruct for local in...

  • qwen2.5-coder-1.5b-instruct

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of t...

  • qwen2.5-coder-1.5b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen...

  • qwen2.5-coder-1.5b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-C...

  • qwen2.5-coder-1.5b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-C...

  • qwen2.5-coder-1.5b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-Coder-0.5B-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-0.5B-Instruct for local...

  • qwen2.5-coder-1.5b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct to enable local inference on TensorRT-RTX GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of ...

  • qwen2.5-coder-1.5b-instruct-vitis-npu

    This model is an optimized version of Qwen2.5-Coder-1.5B-Instruct to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-1.5B-Instruct for local i...

  • qwen2.5-coder-14b-instruct

    This model is an optimized version of Qwen2.5-Coder-14B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of th...

  • qwen2.5-coder-14b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-Coder-14B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2...

  • qwen2.5-coder-14b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-Coder-14B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Co...

  • qwen2.5-coder-14b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-Coder-14B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Co...

  • qwen2.5-coder-14b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-Coder-14B-Instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-14B-Instruct for ...

  • qwen2.5-coder-3b-instruct

    This model is an optimized version of Qwen2.5-Coder-3B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of the...

  • qwen2.5-coder-3b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-Coder-3B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2....

  • qwen2.5-coder-3b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-Coder-3B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Cod...

  • qwen2.5-coder-3b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-Coder-3B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Cod...

  • qwen2.5-coder-7b-instruct

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct for local inference. Optimized models are published here in ONNX format to run on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of the...

  • qwen2.5-coder-7b-instruct-cuda-gpu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on CUDA GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2....

  • qwen2.5-coder-7b-instruct-generic-cpu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on CPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Cod...

  • qwen2.5-coder-7b-instruct-generic-gpu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on GPUs. This model uses RTN quantization.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Cod...

  • qwen2.5-coder-7b-instruct-openvino-npu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on Intel NPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-7B-Instruct for local inf...

  • qwen2.5-coder-7b-instruct-trtrtx-gpu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on TensorRT-RTX GPUs.

Model Description

  • Developed by: Microsoft

  • Model type: ONNX

  • License: apache-2.0

  • Model Description: This is a conversion of the Qwen2.5-Coder-7B-Instruct for lo...

  • qwen2.5-coder-7b-instruct-vitis-npu

    This model is an optimized version of Qwen2.5-Coder-7B-Instruct to enable local inference on AMD NPUs.

Model Description

  • Developed by: Microsoft
  • Model type: ONNX
  • License: apache-2.0
  • Model Description: This is a conversion of the Qwen2.5-Coder-7B-Instruct for local infer...
⚠️ **GitHub.com Fallback** ⚠️