Qwen3 Omni - chunhualiao/public-docs GitHub Wiki

Model size

Based on the "Qwen3-Omni Technical Report," the model has a modular architecture, and the parameter count is distributed across its different components. The main version detailed in the paper is the Qwen3-Omni-30B-A3B.

Table 1 on page 6 of the report provides a clear breakdown of the parameters for each module:

Thinker (MoE Transformer): 30B-A3B (30 Billion parameters)
Talker (MoE Transformer): 3B-A0.3B (3 Billion parameters)
Audio Encoder (AuT): 650M (650 Million parameters)
Vision Encoder (SigLIP2-So400M): 540M (540 Million parameters)
MTP (Dense Transformer): 80M (80 Million parameters)
Code2wav (ConvNet): 200M (200 Million parameters)

The "30B" in the model's name primarily refers to the size of the Thinker module, which handles the core multimodal understanding and reasoning. The total number of parameters is the sum of all these components, but the 30 billion parameter Thinker is the largest and most central part of the system.

That's an excellent question. While the paper doesn't explicitly define the "A3B" notation, it's a standard convention used to describe the architecture of Mixture-of-Experts (MoE) models.

Here's what it means in the context of "30B-A3B":

30B: This refers to the total number of parameters in the model, which is 30 billion. This is the sum of all parameters across all the different "expert" networks within the model.
A3B: This stands for 3 Billion Active parameters. In an MoE model, not all parameters are used for every calculation. Instead, a routing mechanism selects a small subset of "experts" to process each input token. The "A3B" indicates that the combined size of the experts used for any single token is approximately 3 billion parameters.

In short, "30B-A3B" describes a Mixture-of-Experts model that has a vast knowledge capacity stored in 30 billion total parameters, but it remains computationally efficient during inference by only activating about 3 billion of those parameters at any given time.

As shown in Table 1 on page 6, this MoE architecture is used for both the Thinker (30B-A3B) and the Talker (3B-A0.3B) modules of Qwen3-Omni.

New contributions

Based on the technical report, Qwen3-Omni introduces several significant contributions that distinguish it from prior studies in the literature. The central theme is achieving a truly integrated multimodal system without the performance trade-offs that have plagued previous models.

Here are the key new contributions presented in the paper:

1. Overcoming the Modality-Performance Trade-off

This is the most significant contribution claimed by the authors. Prior multimodal models often faced a "modality trade-off," where improving performance in one modality (e.g., vision) would lead to a degradation in another (e.g., text). The report provides what it calls the "first evidence that fully integrated, end-to-end multimodal training can be achieved without degrading core language capability and other modalities" (page 16). Qwen3-Omni demonstrates that it can match the performance of same-sized, specialized unimodal models in text and vision while adding state-of-the-art audio capabilities.

2. State-of-the-Art Performance in Audio Modalities

While many models have focused on text and vision, Qwen3-Omni makes a major leap in audio processing. The abstract states that it achieves open-source state-of-the-art (SOTA) on 32 audio and audio-visual benchmarks and overall SOTA on 22 of them. It specifically outperforms strong closed-source models like Gemini 2.5 Pro, Seed-ASR, and GPT-40-Transcribe on numerous audio tasks. This is largely attributed to its new Audio Transformer (AuT) encoder, which was trained from scratch on 20 million hours of supervised audio data (page 4).

3. Architectural and System-Level Innovations

Compared to its predecessor (Qwen2.5-Omni) and other models, Qwen3-Omni introduces several key upgrades for better performance, scalability, and speed (page 2):

Mixture-of-Experts (MoE) Architecture: Both the "Thinker" (reasoning) and "Talker" (speech generation) components use MoE designs, which allows for high concurrency and faster inference, making the large model more efficient to serve.
Improved Speech Generation Pipeline: The model uses a multi-codebook system and a lightweight causal ConvNet (Code2Wav) instead of a more complex one. This enables very low-latency streaming, achieving an end-to-end first-packet latency of just 234 ms in a cold-start setting (page 1).
Unified Positional Encoding: It uses a refined Time-aligned Multimodal Rotary Position Embedding (TM-RoPE) that directly aligns audio and video representations to absolute timestamps, allowing for more flexible and robust handling of streaming inputs of arbitrary duration (page 5).

4. Introduction of Novel, Specialized Models

The research didn't just produce a single model but a suite of them, addressing specific needs in the research community:

A "Thinking" Model for Enhanced Reasoning: The report introduces Qwen3-Omni-30B-A3B-Thinking, a variant specifically designed to "explicitly reason over inputs from any modality," showing improved performance on complex reasoning tasks across text, vision, and audiovisual inputs (page 1).
A Dedicated Audio Captioning Model: The authors identify a lack of general-purpose audio captioning models in the research community. To fill this gap, they created Qwen3-Omni-30B-A3B-Captioner, which produces detailed and low-hallucination captions for any audio input (page 8).

In summary, the primary contribution of the Qwen3-Omni report is the demonstration that a single, unified model can achieve excellence across text, image, audio, and video without compromise, supported by novel architectural designs, SOTA performance in audio, and the release of new models for advanced reasoning and audio captioning.

How to handle different modalities?

Of course. Based on the "Qwen3-Omni Technical Report," the model handles different modalities through a unified architecture that processes inputs and generates outputs without performance degradation compared to specialized single-modal models.

Here's a breakdown of its approach:

Core Architecture: Thinker-Talker Model

Qwen3-Omni uses a "Thinker-Talker" architecture, as shown in Figure 2 on page 3. This design separates the tasks of understanding and responding:

The Thinker: This is a Mixture-of-Experts (MoE) Transformer responsible for processing and understanding inputs from all modalities (text, audio, vision). It performs the core reasoning and generates the textual part of the response.
The Talker: This is also an MoE Transformer, but its primary function is to generate streaming speech. It takes high-level representations from the Thinker to synthesize audio output that is contextually appropriate.

Processing Different Modalities (Perceivation)

The model uses specialized encoders to convert each type of input into a format the Thinker can understand, as detailed in Section 2.3 (page 4):

Audio: Audio inputs are first resampled to 16 kHz and converted into a 128-channel mel-spectrogram. This spectrogram is then processed by a dedicated Audio Transformer (AuT) encoder, which is a powerful model trained from scratch on 20 million hours of audio data.
Vision (Image and Video): For visual inputs, the model employs the vision encoder from Qwen3-VL, which was initialized from SigLIP2-So400m. This allows it to handle both still images and video frames. To preserve information in videos, it samples frames at a dynamic rate.
Text: Text inputs are processed using Qwen's tokenizer, which uses byte-level byte-pair encoding.

Unifying Modalities with Positional Embedding

A key challenge in multimodal models is integrating information from different sources that have different structures (e.g., a sequence of text vs. a grid of pixels vs. a time-series of audio). Qwen3-Omni solves this using a Time-aligned Multimodal Rotary Position Embedding (TM-RoPE), as described on page 5.

This technique encodes the position of information not just in one dimension (like text) but across three: temporal, height, and width. For audiovisual streams, it aligns both the audio and video components to a consistent temporal resolution of 80 ms per temporal ID. This refined positional encoding allows the model to effectively "integrate and jointly model information from diverse modalities."

In summary, Qwen3-Omni uses a modular system with specialized encoders for each modality and a sophisticated positional embedding technique to unify them into a common representational space. This allows the central "Thinker" model to reason across text, images, audio, and video simultaneously.