Implementation:Alibaba ROLL Qwen3OmniMoeModel
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, Multimodal, Vision_Language |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Qwen3 Omni multimodal MoE model implementation supporting audio, image, and video inputs for distributed training with Megatron-Core.
Description
modeling_qwen3_omni.py implements Qwen3OmniMoeModel, a multimodal Mixture-of-Experts model that extends Qwen3VLModel to support audio modality in addition to images and videos. The model is registered as qwen3_omni_moe via the @register_model decorator.
The architecture consists of:
- Audio encoder: A Qwen3OmniMoeAudioEncoder initialized on the first pipeline stage (pre_process) that processes raw audio features into embeddings. Uses SDPA attention and supports gradient checkpointing for memory optimization.
- Vision encoder: A Qwen3OmniMoeVisionEncoder also on the first pipeline stage, inherited from Qwen3VLModel, that processes images and videos with deepstack visual embeddings.
- Language model: The Megatron-Core GPT decoder inherited from Qwen3VLGPTModel.
- Optional talker and code2wav modules: On the last pipeline stage (post_process) when enable_audio_output is enabled, for text-to-speech generation.
The construct_inputs_embeds method (lines 88-226) is the core multimodal fusion function. It:
- Delegates image/video processing to the parent Qwen3VLModel.construct_inputs_embeds
- Processes audio features by extracting relevant audio segments based on input_ranges (sub-sequences assigned to this pipeline/sequence parallel rank)
- Runs the audio encoder on collected features
- Scatters audio embeddings back into the combined input embedding tensor using masked_scatter
The audio processing logic handles complex scenarios including:
- Multiple audio segments across batch samples
- Audio features split across sub-ranges in pipeline parallel settings
- Deduplication of audio features already processed in previous sub-ranges
- Proper index tracking for feature-to-embedding mapping
The forward method (lines 228-305) handles:
- Automatic rope index computation for all modalities (image, video, audio)
- Context parallelism batch slicing
- Vision and audio feature injection on the first pipeline stage
- Fallback to standard decoder forward on non-first pipeline stages
Usage
Use this model class for training multimodal MoE models that process audio, images, and video simultaneously. It is instantiated via AutoModel using the qwen3_omni_moe model type registration.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/models/qwen3_omni/modeling_qwen3_omni.py
- Lines: 1-305
Signature
@register_model("qwen3_omni_moe")
class Qwen3OmniMoeModel(Qwen3VLModel):
config_class = Qwen3OmniMoeConfig
Key Methods
__init__
def __init__(self, config: "Qwen3OmniMoeConfig", **kwargs) # lines 16-86
Initializes the model with:
- Parent Qwen3VLGPTModel initialization
- Audio encoder (Qwen3OmniMoeAudioEncoder) on the first pipeline stage with SDPA attention and gradient checkpointing for full recomputation
- Vision encoder (Qwen3OmniMoeVisionEncoder) on the first pipeline stage with optional gradient checkpointing
- Optional talker and code2wav modules on the last pipeline stage for audio output
- Rope index computation methods bound from the HF Qwen3OmniMoePreTrainedModelForConditionalGeneration
- All encoder parameters marked with the sequence_parallel attribute
construct_inputs_embeds
def construct_inputs_embeds(
self,
input_ids: "torch.LongTensor",
inputs_embeds: "torch.FloatTensor",
pixel_values: "torch.Tensor",
grid_thw: "torch.LongTensor",
pixel_values_videos: "torch.Tensor",
video_grid_thw: "torch.LongTensor",
input_features: "torch.Tensor",
feature_lens: "torch.Tensor",
feature_attention_mask: "torch.Tensor",
input_ranges: List[List[int]],
image_token_id: int,
video_token_id: int,
audio_token_id: int,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor]] # lines 88-226
Processes all modalities and merges them into a single embedding tensor. Currently images and videos cannot be processed simultaneously (assertion enforced). Returns (inputs_embeds, visual_pos_masks, deepstack_visual_embeds).
forward
def forward(
self,
input_ids: "torch.Tensor",
position_ids: Optional["torch.Tensor"] = None,
attention_mask: Optional["torch.Tensor"] = None,
decoder_input: Optional["torch.Tensor"] = None,
labels: Optional["torch.Tensor"] = None,
pixel_values: Optional["torch.Tensor"] = None,
pixel_values_videos: Optional["torch.Tensor"] = None,
image_grid_thw: Optional["torch.LongTensor"] = None,
video_grid_thw: Optional["torch.LongTensor"] = None,
input_features: Optional["torch.Tensor"] = None,
feature_attention_mask: Optional["torch.Tensor"] = None,
**kwargs,
) -> "torch.Tensor" # lines 228-305
Full forward pass. Computes rope indices for all modalities, applies context parallelism, embeds tokens, injects visual and audio features, then runs the transformer decoder.
Import
import torch
from megatron.core import mpu
from mcore_adapter.models.qwen3_omni.modeling_qwen3_omni import Qwen3OmniMoeModel
from mcore_adapter.models.qwen3_omni.config_qwen3_omni import Qwen3OmniMoeConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | Yes | Token IDs with special tokens for image, video, and audio placeholders |
| pixel_values | torch.Tensor | No | Image pixel values for vision encoder |
| pixel_values_videos | torch.Tensor | No | Video pixel values for vision encoder |
| image_grid_thw | torch.LongTensor | No | Grid dimensions (temporal, height, width) for images |
| video_grid_thw | torch.LongTensor | No | Grid dimensions for videos |
| input_features | torch.Tensor | No | Audio features in shape (batch, frequency, frames) |
| feature_attention_mask | torch.Tensor | No | Attention mask for audio features |
| labels | torch.Tensor | No | Target labels for language modeling loss |
| attention_mask | torch.Tensor | No | Attention mask for the sequence |
| position_ids | torch.Tensor | No | Position IDs (auto-computed from rope index if not provided) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor | Model output logits or loss depending on pipeline stage |
Usage Examples
from mcore_adapter.models import AutoModel
from mcore_adapter.training_args import TrainingArguments
# Load Qwen3-Omni MoE model with distributed training
args = TrainingArguments(
tensor_model_parallel_size=4,
pipeline_model_parallel_size=2,
expert_model_parallel_size=2,
bf16=True,
output_dir="/tmp/output",
)
model = AutoModel.from_pretrained("Qwen/Qwen3-Omni-MoE", args)
# Forward pass with multimodal inputs
output = model(
input_ids=input_ids,
pixel_values=pixel_values,
image_grid_thw=image_grid_thw,
input_features=audio_features,
feature_attention_mask=audio_mask,
labels=labels,
)