Implementation:Alibaba ROLL Qwen3OmniMoeConfig
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Multimodal |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
Configuration dataclass for the Qwen3 Omni Mixture-of-Experts model, defining multimodal parameters for vision, audio, and speech output.
Description
Qwen3OmniMoeConfig extends McaModelConfig to provide model-specific configuration for the Qwen3-Omni multimodal MoE architecture. It is registered with the auto-config registry under the model type "qwen3_omni_moe" via the @register_config decorator.
The config defines special token IDs for multimodal content:
- audio_token_id (default 151675), image_token_id (default 151655), video_token_id (default 151656) for identifying multimodal tokens in sequences
- audio_start_token_id (default 151669) and vision_start_token_id (default 151652) for marking the start of audio and vision segments
- position_id_per_seconds (default 13) for audio position encoding
Optional sub-model configurations are stored as dictionaries:
- vision_config: Vision encoder configuration
- audio_config: Audio model configuration
- talker_config: Talker model configuration for speech output
- code2wav_config: Code-to-waveform model configuration
- rope_scaling: Rotary position embedding scaling parameters (must contain mrope_section)
The __post_init__ method converts any PretrainedConfig sub-configs to dictionaries, instantiates a Qwen3OmniMoeVisionEncoderConfig from the vision config to compute derived values (merge_size, pixel_values_dim), and extracts mrope_section from the rope_scaling dictionary.
Usage
Use this config when working with Qwen3-Omni MoE models in the mcore_adapter framework. It is typically loaded via AutoConfig.from_pretrained rather than instantiated directly. The derived fields (merge_size, pixel_values_dim, mrope_section) are used by the model's vision and position encoding components.
Code Reference
Source Location
- Repository: Alibaba_ROLL
- File: mcore_adapter/src/mcore_adapter/models/qwen3_omni/config_qwen3_omni.py
- Lines: 1-69
Signature
@register_config("qwen3_omni_moe")
@dataclass
class Qwen3OmniMoeConfig(McaModelConfig):
audio_token_id: int = 151675
image_token_id: int = 151655
video_token_id: int = 151656
position_id_per_seconds: int = 13
audio_start_token_id: int = 151669
vision_start_token_id: int = 151652
vision_config: Optional[dict] = field(default=None)
audio_config: Optional[dict] = field(default=None)
enable_audio_output: bool = False
talker_config: Optional[dict] = field(default=None)
code2wav_config: Optional[dict] = field(default=None)
rope_scaling: Optional[dict] = field(default=None)
def __post_init__(self) -> None: ...
Import
from mcore_adapter.models.qwen3_omni.config_qwen3_omni import Qwen3OmniMoeConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| audio_token_id | int | No | Token ID for audio content (default 151675) |
| image_token_id | int | No | Token ID for image content (default 151655) |
| video_token_id | int | No | Token ID for video content (default 151656) |
| position_id_per_seconds | int | No | Positional ID granularity for audio (default 13) |
| audio_start_token_id | int | No | Token ID marking start of audio segment (default 151669) |
| vision_start_token_id | int | No | Token ID marking start of vision segment (default 151652) |
| vision_config | Optional[dict] | No | Vision encoder configuration dictionary |
| audio_config | Optional[dict] | No | Audio model configuration dictionary |
| enable_audio_output | bool | No | Whether audio output is enabled (default False) |
| talker_config | Optional[dict] | No | Talker model configuration dictionary |
| code2wav_config | Optional[dict] | No | Code-to-waveform model configuration dictionary |
| rope_scaling | Optional[dict] | No | RoPE scaling config; must contain "mrope_section" key |
Outputs
| Name | Type | Description |
|---|---|---|
| (instance) | Qwen3OmniMoeConfig | Config instance with all fields set and derived values computed |
| merge_size | int | Spatial merge size from vision config (computed in __post_init__) |
| pixel_values_dim | int | Pixel input dimension: patch_size^2 * in_channels * temporal_patch_size (computed in __post_init__) |
| mrope_section | list | Multi-RoPE section sizes extracted from rope_scaling (computed in __post_init__) |
Usage Examples
from mcore_adapter.models.auto.config_auto import AutoConfig
# Load config from a Qwen3-Omni checkpoint
config = AutoConfig.from_pretrained("/path/to/qwen3-omni-moe")
# Access multimodal token IDs
print(f"Image token: {config.image_token_id}")
print(f"Audio token: {config.audio_token_id}")
# Access derived vision parameters
print(f"Pixel values dim: {config.pixel_values_dim}")
print(f"Merge size: {config.merge_size}")
print(f"mRoPE section: {config.mrope_section}")