Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Qwen Model

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Large Language Models, Model Architecture
Last Updated 2026-02-09 19:00 GMT

Overview

Implements the Alibaba QWen (Qwen-1) transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module defines the complete QWen (first-generation Qwen) model architecture. QWen uses multi-head attention (MHA) without grouped-query attention -- all heads serve as both query and key/value heads. A distinctive feature is that attention uses bias on the QKV projection but not on the output projection, and the MLP uses a gate-up projection structure where the output features are split in half for gating.

The module contains the following key classes:

  • QWenConfig -- Configuration dataclass with QWen-specific parameters including rotary_emb_base (instead of the more common rope_theta), scale_attn_weights flag, and kv_channels. Derives head_dim and validates configuration constraints.
  • QWenAttention -- Multi-head attention with fused Q/K/V projection (c_attn) using bias=True and output projection (c_proj) using bias=False. Uses standard multi-head attention (num_kv_heads == num_q_heads) with PagedKVCache.
  • QWenMLP -- A gated MLP with a gate_up_proj linear layer that outputs intermediate_size features, which are then split in half. One half is gated by SiLU of the other half, followed by a down projection (c_proj).
  • QWenBlock -- A standard pre-norm transformer block with RMSNorm applied before both self-attention and MLP, sequential computation, and residual connections with tensor-parallel all-reduce.
  • QWenModel -- The backbone consisting of token embeddings (wte), a stack of QWenBlocks, and a final RMSNorm (ln_f).
  • QWenLMHeadModel -- The top-level causal language model. Notably, the LM head is initialized with dtype="float32" explicitly, and the KV cache uses num_attention_heads for both query and key/value heads.

Usage

Use this module when compiling and deploying first-generation QWen models through MLC-LLM. This covers QWen-7B, QWen-14B, and QWen-72B. The module supports tensor parallelism and efficient paged KV cache inference.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class QWenConfig(ConfigBase):
    vocab_size: int
    hidden_size: int
    num_hidden_layers: int
    num_attention_heads: int
    layer_norm_epsilon: float
    scale_attn_weights: bool
    kv_channels: int
    rotary_emb_base: int
    intermediate_size: int
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    tensor_parallel_shards: int = 1
    max_batch_size: int = 1
    head_dim: int = 0
    ...

class QWenAttention(nn.Module):
    def __init__(self, config: QWenConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class QWenMLP(nn.Module):
    def __init__(self, config: QWenConfig): ...
    def forward(self, x: Tensor): ...

class QWenBlock(nn.Module):
    def __init__(self, config: QWenConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class QWenModel(nn.Module):
    def __init__(self, config: QWenConfig): ...
    def forward(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...

class QWenLMHeadModel(nn.Module):
    def __init__(self, config: QWenConfig): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, inputs: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, inputs, paged_kv_cache, logit_positions=None): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.qwen.qwen_model import QWenConfig, QWenLMHeadModel

I/O Contract

Method Input Output Description
embed input_ids: Tensor[seq_len] (int32) Tensor[1, seq_len, hidden_size] Converts token IDs to embeddings via wte
prefill inputs: Tensor[1, seq_len, hidden_size], paged_kv_cache (logits: Tensor[1, 1, vocab_size], paged_kv_cache) Full prompt processing; returns last-token logits
decode inputs: Tensor[1, 1, hidden_size], paged_kv_cache (logits, paged_kv_cache) Single-token autoregressive decoding
batch_prefill inputs, logit_positions, paged_kv_cache (logits, paged_kv_cache) Batched prefill with selective logit extraction
batch_decode inputs: Tensor[batch_size, 1, hidden_size], paged_kv_cache (logits, paged_kv_cache) Batched single-token decoding
batch_verify inputs, paged_kv_cache (logits, paged_kv_cache) Batched speculative verification
Architectural Feature Details
Attention Type Standard MHA (no grouped-query attention); num_kv_heads == num_q_heads
Attention Bias c_attn (QKV) has bias=True; c_proj (output) has bias=False
MLP Structure gate_up_proj outputs intermediate_size, split in half for SiLU gating; c_proj halves input
Normalization RMSNorm without bias
RoPE Uses rotary_emb_base (equivalent to rope_theta)
LM Head Separate linear layer (bias=False) initialized with float32 dtype

Usage Examples

# QWen-7B configuration
config = QWenConfig(
    vocab_size=151936,
    hidden_size=4096,
    num_hidden_layers=32,
    num_attention_heads=32,
    layer_norm_epsilon=1e-6,
    scale_attn_weights=True,
    kv_channels=128,
    rotary_emb_base=10000,
    intermediate_size=22016,
    context_window_size=8192,
)

model = QWenLMHeadModel(config)
model.to("float16")
spec = model.get_default_spec()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment