Implementation:Mlc ai Mlc llm Qwen3 Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Machine Learning, Large Language Models, Model Architecture
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the Alibaba Qwen3 transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module defines the complete Qwen3 model architecture, a significant evolution from the original QWen. Qwen3 introduces grouped-query attention (GQA), QK normalization via per-head RMSNorm, configurable attention bias, configurable activation functions, optional weight tying, and support for FP8 block-wise quantization.

The module contains the following key classes:

Qwen3Config -- Configuration dataclass with Qwen3-specific parameters including hidden_act (activation function name), attention_bias flag, tie_word_embeddings, and optional weight_block_size for FP8 quantization. Supports parsing quantization_config from model JSON for dynamic FP8 quantization with block scaling.
Qwen3Attention -- Multi-head attention with GQA, fused QKV projection (c_attn), separate q_norm and k_norm (per-head RMSNorm applied to query and key tensors after projection), and configurable attention bias.
Qwen3Embedding -- A specialized embedding class that extends nn.Embedding with a lm_head_forward method for weight tying, performing a transposed matmul with float32 output.
Qwen3MLP -- A gated MLP with configurable activation function (via the ACT2FN dictionary mapping names to functions: gelu, relu, silu, swish, gelu_new) and a fused gate-up projection.
Qwen3DecoderLayer -- A standard pre-norm transformer decoder layer with RMSNorm, sequential attention-then-MLP, and tensor-parallel residual connections.
Qwen3Model -- The backbone consisting of Qwen3Embedding, decoder layers, and final RMSNorm.
Qwen3LMHeadModel -- The top-level model supporting optional weight tying (using Qwen3Embedding.lm_head_forward when tied), FP8 block quantization metadata, and the standard inference interface.

Usage

Use this module when compiling and deploying Qwen3 models through MLC-LLM. This includes various Qwen3 model sizes. The module supports tensor parallelism, paged KV cache, and optional FP8 block-wise quantized weight loading.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/qwen3/qwen3_model.py

Signature

@dataclasses.dataclass
class Qwen3Config(ConfigBase):
    hidden_act: str
    hidden_size: int
    intermediate_size: int
    attention_bias: bool
    num_attention_heads: int
    num_hidden_layers: int
    num_key_value_heads: int
    rms_norm_eps: float
    rope_theta: int
    vocab_size: int
    tie_word_embeddings: bool = False
    weight_block_size: Optional[Tuple[int, int]] = None
    ...

class Qwen3Attention(nn.Module):
    def __init__(self, config: Qwen3Config): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class Qwen3Embedding(nn.Embedding):
    def lm_head_forward(self, x: nn.Tensor): ...

class Qwen3MLP(nn.Module):
    def __init__(self, config: Qwen3Config): ...
    def forward(self, x: Tensor): ...

class Qwen3DecoderLayer(nn.Module):
    def __init__(self, config: Qwen3Config): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class Qwen3LMHeadModel(nn.Module):
    def __init__(self, config: Qwen3Config): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, input_embeds, paged_kv_cache, logit_positions=None): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.qwen3.qwen3_model import Qwen3Config, Qwen3LMHeadModel

I/O Contract

Method	Input	Output	Description
embed	input_ids: Tensor[seq_len] (int32)	Tensor[1, seq_len, hidden_size]	Converts token IDs to embeddings via Qwen3Embedding
prefill	input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Full prompt processing with last-token logit extraction; uses tied or separate LM head
decode	input_embed: Tensor[1, 1, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Single-token autoregressive decoding
batch_prefill	input_embeds, logit_positions, paged_kv_cache	(logits, paged_kv_cache)	Batched prefill with selective logit extraction
batch_decode	input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Batched single-token decoding
batch_verify	input_embeds, paged_kv_cache	(logits, paged_kv_cache)	Batched speculative verification

Architectural Feature	Details
QK Normalization	Per-head RMSNorm on query (q_norm) and key (k_norm) tensors after projection and before attention
Grouped-Query Attention	Separate num_attention_heads and num_key_value_heads; num_kv_heads must divide tensor_parallel_shards
Configurable Activation	ACT2FN map supporting gelu, relu, silu/swish, gelu_new
Weight Tying	Optional tie_word_embeddings using Qwen3Embedding.lm_head_forward
FP8 Quantization	Optional weight_block_size for e4m3 dynamic FP8 block-wise quantization
Configurable Bias	attention_bias parameter controls whether QKV and output projections include bias

Usage Examples

# Qwen3 model configuration
config = Qwen3Config(
    hidden_act="silu",
    hidden_size=4096,
    intermediate_size=11008,
    attention_bias=True,
    num_attention_heads=32,
    num_hidden_layers=32,
    num_key_value_heads=8,
    rms_norm_eps=1e-6,
    rope_theta=1000000,
    vocab_size=151936,
    context_window_size=32768,
)

model = Qwen3LMHeadModel(config)
model.to("float16")
spec = model.get_default_spec()

# With weight tying enabled
config_tied = Qwen3Config(
    hidden_act="silu",
    hidden_size=2048,
    intermediate_size=5504,
    attention_bias=False,
    num_attention_heads=16,
    num_hidden_layers=24,
    num_key_value_heads=4,
    rms_norm_eps=1e-6,
    rope_theta=1000000,
    vocab_size=151936,
    tie_word_embeddings=True,
    context_window_size=32768,
)
model_tied = Qwen3LMHeadModel(config_tied)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment