Implementation:Mlc ai Mlc llm Qwen3 Model
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Large Language Models, Model Architecture |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the Alibaba Qwen3 transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.
Description
This module defines the complete Qwen3 model architecture, a significant evolution from the original QWen. Qwen3 introduces grouped-query attention (GQA), QK normalization via per-head RMSNorm, configurable attention bias, configurable activation functions, optional weight tying, and support for FP8 block-wise quantization.
The module contains the following key classes:
- Qwen3Config -- Configuration dataclass with Qwen3-specific parameters including hidden_act (activation function name), attention_bias flag, tie_word_embeddings, and optional weight_block_size for FP8 quantization. Supports parsing quantization_config from model JSON for dynamic FP8 quantization with block scaling.
- Qwen3Attention -- Multi-head attention with GQA, fused QKV projection (c_attn), separate q_norm and k_norm (per-head RMSNorm applied to query and key tensors after projection), and configurable attention bias.
- Qwen3Embedding -- A specialized embedding class that extends nn.Embedding with a lm_head_forward method for weight tying, performing a transposed matmul with float32 output.
- Qwen3MLP -- A gated MLP with configurable activation function (via the ACT2FN dictionary mapping names to functions: gelu, relu, silu, swish, gelu_new) and a fused gate-up projection.
- Qwen3DecoderLayer -- A standard pre-norm transformer decoder layer with RMSNorm, sequential attention-then-MLP, and tensor-parallel residual connections.
- Qwen3Model -- The backbone consisting of Qwen3Embedding, decoder layers, and final RMSNorm.
- Qwen3LMHeadModel -- The top-level model supporting optional weight tying (using Qwen3Embedding.lm_head_forward when tied), FP8 block quantization metadata, and the standard inference interface.
Usage
Use this module when compiling and deploying Qwen3 models through MLC-LLM. This includes various Qwen3 model sizes. The module supports tensor parallelism, paged KV cache, and optional FP8 block-wise quantized weight loading.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/qwen3/qwen3_model.py
Signature
@dataclasses.dataclass
class Qwen3Config(ConfigBase):
hidden_act: str
hidden_size: int
intermediate_size: int
attention_bias: bool
num_attention_heads: int
num_hidden_layers: int
num_key_value_heads: int
rms_norm_eps: float
rope_theta: int
vocab_size: int
tie_word_embeddings: bool = False
weight_block_size: Optional[Tuple[int, int]] = None
...
class Qwen3Attention(nn.Module):
def __init__(self, config: Qwen3Config): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class Qwen3Embedding(nn.Embedding):
def lm_head_forward(self, x: nn.Tensor): ...
class Qwen3MLP(nn.Module):
def __init__(self, config: Qwen3Config): ...
def forward(self, x: Tensor): ...
class Qwen3DecoderLayer(nn.Module):
def __init__(self, config: Qwen3Config): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class Qwen3LMHeadModel(nn.Module):
def __init__(self, config: Qwen3Config): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_forward(self, input_embeds, paged_kv_cache, logit_positions=None): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.qwen3.qwen3_model import Qwen3Config, Qwen3LMHeadModel
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| embed | input_ids: Tensor[seq_len] (int32) | Tensor[1, seq_len, hidden_size] | Converts token IDs to embeddings via Qwen3Embedding |
| prefill | input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Full prompt processing with last-token logit extraction; uses tied or separate LM head |
| decode | input_embed: Tensor[1, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Single-token autoregressive decoding |
| batch_prefill | input_embeds, logit_positions, paged_kv_cache | (logits, paged_kv_cache) | Batched prefill with selective logit extraction |
| batch_decode | input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Batched single-token decoding |
| batch_verify | input_embeds, paged_kv_cache | (logits, paged_kv_cache) | Batched speculative verification |
| Architectural Feature | Details |
|---|---|
| QK Normalization | Per-head RMSNorm on query (q_norm) and key (k_norm) tensors after projection and before attention |
| Grouped-Query Attention | Separate num_attention_heads and num_key_value_heads; num_kv_heads must divide tensor_parallel_shards |
| Configurable Activation | ACT2FN map supporting gelu, relu, silu/swish, gelu_new |
| Weight Tying | Optional tie_word_embeddings using Qwen3Embedding.lm_head_forward |
| FP8 Quantization | Optional weight_block_size for e4m3 dynamic FP8 block-wise quantization |
| Configurable Bias | attention_bias parameter controls whether QKV and output projections include bias |
Usage Examples
# Qwen3 model configuration
config = Qwen3Config(
hidden_act="silu",
hidden_size=4096,
intermediate_size=11008,
attention_bias=True,
num_attention_heads=32,
num_hidden_layers=32,
num_key_value_heads=8,
rms_norm_eps=1e-6,
rope_theta=1000000,
vocab_size=151936,
context_window_size=32768,
)
model = Qwen3LMHeadModel(config)
model.to("float16")
spec = model.get_default_spec()
# With weight tying enabled
config_tied = Qwen3Config(
hidden_act="silu",
hidden_size=2048,
intermediate_size=5504,
attention_bias=False,
num_attention_heads=16,
num_hidden_layers=24,
num_key_value_heads=4,
rms_norm_eps=1e-6,
rope_theta=1000000,
vocab_size=151936,
tie_word_embeddings=True,
context_window_size=32768,
)
model_tied = Qwen3LMHeadModel(config_tied)