Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba ROLL Qwen3OmniMoeConfig

From Leeroopedia


Knowledge Sources
Domains Configuration, Multimodal
Last Updated 2026-02-07 20:00 GMT

Overview

Configuration dataclass for the Qwen3 Omni Mixture-of-Experts model, defining multimodal parameters for vision, audio, and speech output.

Description

Qwen3OmniMoeConfig extends McaModelConfig to provide model-specific configuration for the Qwen3-Omni multimodal MoE architecture. It is registered with the auto-config registry under the model type "qwen3_omni_moe" via the @register_config decorator.

The config defines special token IDs for multimodal content:

  • audio_token_id (default 151675), image_token_id (default 151655), video_token_id (default 151656) for identifying multimodal tokens in sequences
  • audio_start_token_id (default 151669) and vision_start_token_id (default 151652) for marking the start of audio and vision segments
  • position_id_per_seconds (default 13) for audio position encoding

Optional sub-model configurations are stored as dictionaries:

  • vision_config: Vision encoder configuration
  • audio_config: Audio model configuration
  • talker_config: Talker model configuration for speech output
  • code2wav_config: Code-to-waveform model configuration
  • rope_scaling: Rotary position embedding scaling parameters (must contain mrope_section)

The __post_init__ method converts any PretrainedConfig sub-configs to dictionaries, instantiates a Qwen3OmniMoeVisionEncoderConfig from the vision config to compute derived values (merge_size, pixel_values_dim), and extracts mrope_section from the rope_scaling dictionary.

Usage

Use this config when working with Qwen3-Omni MoE models in the mcore_adapter framework. It is typically loaded via AutoConfig.from_pretrained rather than instantiated directly. The derived fields (merge_size, pixel_values_dim, mrope_section) are used by the model's vision and position encoding components.

Code Reference

Source Location

Signature

@register_config("qwen3_omni_moe")
@dataclass
class Qwen3OmniMoeConfig(McaModelConfig):
    audio_token_id: int = 151675
    image_token_id: int = 151655
    video_token_id: int = 151656
    position_id_per_seconds: int = 13
    audio_start_token_id: int = 151669
    vision_start_token_id: int = 151652
    vision_config: Optional[dict] = field(default=None)
    audio_config: Optional[dict] = field(default=None)
    enable_audio_output: bool = False
    talker_config: Optional[dict] = field(default=None)
    code2wav_config: Optional[dict] = field(default=None)
    rope_scaling: Optional[dict] = field(default=None)

    def __post_init__(self) -> None: ...

Import

from mcore_adapter.models.qwen3_omni.config_qwen3_omni import Qwen3OmniMoeConfig

I/O Contract

Inputs

Name Type Required Description
audio_token_id int No Token ID for audio content (default 151675)
image_token_id int No Token ID for image content (default 151655)
video_token_id int No Token ID for video content (default 151656)
position_id_per_seconds int No Positional ID granularity for audio (default 13)
audio_start_token_id int No Token ID marking start of audio segment (default 151669)
vision_start_token_id int No Token ID marking start of vision segment (default 151652)
vision_config Optional[dict] No Vision encoder configuration dictionary
audio_config Optional[dict] No Audio model configuration dictionary
enable_audio_output bool No Whether audio output is enabled (default False)
talker_config Optional[dict] No Talker model configuration dictionary
code2wav_config Optional[dict] No Code-to-waveform model configuration dictionary
rope_scaling Optional[dict] No RoPE scaling config; must contain "mrope_section" key

Outputs

Name Type Description
(instance) Qwen3OmniMoeConfig Config instance with all fields set and derived values computed
merge_size int Spatial merge size from vision config (computed in __post_init__)
pixel_values_dim int Pixel input dimension: patch_size^2 * in_channels * temporal_patch_size (computed in __post_init__)
mrope_section list Multi-RoPE section sizes extracted from rope_scaling (computed in __post_init__)

Usage Examples

from mcore_adapter.models.auto.config_auto import AutoConfig

# Load config from a Qwen3-Omni checkpoint
config = AutoConfig.from_pretrained("/path/to/qwen3-omni-moe")

# Access multimodal token IDs
print(f"Image token: {config.image_token_id}")
print(f"Audio token: {config.audio_token_id}")

# Access derived vision parameters
print(f"Pixel values dim: {config.pixel_values_dim}")
print(f"Merge size: {config.merge_size}")
print(f"mRoPE section: {config.mrope_section}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment