Implementation:Mlc ai Mlc llm Qwen Model
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Large Language Models, Model Architecture |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the Alibaba QWen (Qwen-1) transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.
Description
This module defines the complete QWen (first-generation Qwen) model architecture. QWen uses multi-head attention (MHA) without grouped-query attention -- all heads serve as both query and key/value heads. A distinctive feature is that attention uses bias on the QKV projection but not on the output projection, and the MLP uses a gate-up projection structure where the output features are split in half for gating.
The module contains the following key classes:
- QWenConfig -- Configuration dataclass with QWen-specific parameters including rotary_emb_base (instead of the more common rope_theta), scale_attn_weights flag, and kv_channels. Derives head_dim and validates configuration constraints.
- QWenAttention -- Multi-head attention with fused Q/K/V projection (c_attn) using bias=True and output projection (c_proj) using bias=False. Uses standard multi-head attention (num_kv_heads == num_q_heads) with PagedKVCache.
- QWenMLP -- A gated MLP with a gate_up_proj linear layer that outputs intermediate_size features, which are then split in half. One half is gated by SiLU of the other half, followed by a down projection (c_proj).
- QWenBlock -- A standard pre-norm transformer block with RMSNorm applied before both self-attention and MLP, sequential computation, and residual connections with tensor-parallel all-reduce.
- QWenModel -- The backbone consisting of token embeddings (wte), a stack of QWenBlocks, and a final RMSNorm (ln_f).
- QWenLMHeadModel -- The top-level causal language model. Notably, the LM head is initialized with dtype="float32" explicitly, and the KV cache uses num_attention_heads for both query and key/value heads.
Usage
Use this module when compiling and deploying first-generation QWen models through MLC-LLM. This covers QWen-7B, QWen-14B, and QWen-72B. The module supports tensor parallelism and efficient paged KV cache inference.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/qwen/qwen_model.py
Signature
@dataclasses.dataclass
class QWenConfig(ConfigBase):
vocab_size: int
hidden_size: int
num_hidden_layers: int
num_attention_heads: int
layer_norm_epsilon: float
scale_attn_weights: bool
kv_channels: int
rotary_emb_base: int
intermediate_size: int
context_window_size: int = 0
prefill_chunk_size: int = 0
tensor_parallel_shards: int = 1
max_batch_size: int = 1
head_dim: int = 0
...
class QWenAttention(nn.Module):
def __init__(self, config: QWenConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class QWenMLP(nn.Module):
def __init__(self, config: QWenConfig): ...
def forward(self, x: Tensor): ...
class QWenBlock(nn.Module):
def __init__(self, config: QWenConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class QWenModel(nn.Module):
def __init__(self, config: QWenConfig): ...
def forward(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...
class QWenLMHeadModel(nn.Module):
def __init__(self, config: QWenConfig): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_forward(self, inputs, paged_kv_cache, logit_positions=None): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.qwen.qwen_model import QWenConfig, QWenLMHeadModel
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| embed | input_ids: Tensor[seq_len] (int32) | Tensor[1, seq_len, hidden_size] | Converts token IDs to embeddings via wte |
| prefill | inputs: Tensor[1, seq_len, hidden_size], paged_kv_cache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Full prompt processing; returns last-token logits |
| decode | inputs: Tensor[1, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Single-token autoregressive decoding |
| batch_prefill | inputs, logit_positions, paged_kv_cache | (logits, paged_kv_cache) | Batched prefill with selective logit extraction |
| batch_decode | inputs: Tensor[batch_size, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Batched single-token decoding |
| batch_verify | inputs, paged_kv_cache | (logits, paged_kv_cache) | Batched speculative verification |
| Architectural Feature | Details |
|---|---|
| Attention Type | Standard MHA (no grouped-query attention); num_kv_heads == num_q_heads |
| Attention Bias | c_attn (QKV) has bias=True; c_proj (output) has bias=False |
| MLP Structure | gate_up_proj outputs intermediate_size, split in half for SiLU gating; c_proj halves input |
| Normalization | RMSNorm without bias |
| RoPE | Uses rotary_emb_base (equivalent to rope_theta) |
| LM Head | Separate linear layer (bias=False) initialized with float32 dtype |
Usage Examples
# QWen-7B configuration
config = QWenConfig(
vocab_size=151936,
hidden_size=4096,
num_hidden_layers=32,
num_attention_heads=32,
layer_norm_epsilon=1e-6,
scale_attn_weights=True,
kv_channels=128,
rotary_emb_base=10000,
intermediate_size=22016,
context_window_size=8192,
)
model = QWenLMHeadModel(config)
model.to("float16")
spec = model.get_default_spec()