Implementation:Mlc ai Mlc llm Phi3 Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Machine Learning, Large Language Models, Model Architecture
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the Microsoft Phi-3 transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module defines the complete Phi-3 model architecture, which represents a significant evolution from Phi-2. Phi-3 moves to a more standard Llama-like architecture with RMSNorm, SiLU-gated MLP, and support for LongRoPE scaling. A notable feature is the optional tied word embeddings, where the embedding weight matrix is shared with the LM head via a transposed matrix multiplication.

The module contains the following key classes:

Phi3Config -- Configuration dataclass supporting RoPE scaling (LongRoPE with "su"/"longrope" type), tied word embeddings, partial rotary factor, and standard transformer hyperparameters.
Phi3Embedding -- A specialized embedding class that extends nn.Embedding with a lm_head_forward method enabling weight tying between the input embedding and output projection.
Phi3MLP -- A gated MLP using a fused gate-up projection with SiLU activation and a down projection, with no bias terms.
PhiMHA -- Multi-head attention with grouped-query attention (GQA) support, fused QKV projection without bias, and PagedKVCache integration.
Phi3ParallelBlock -- A transformer block with pre-norm RMSNorm on both attention and MLP sub-layers, sequential (not parallel) attention-then-MLP computation, and residual connections with tensor-parallel all-reduce support.
Phi3Model -- The backbone model with Phi3Embedding, a stack of Phi3ParallelBlock layers, and a final RMSNorm.
Phi3ForCausalLM -- The top-level causal LM supporting optional weight tying, LongRoPE scaling with extension factors, and partial rotary embeddings.

Usage

Use this module when compiling and deploying Phi-3 models (including Phi-3-mini, Phi-3-small, and Phi-3-medium variants) through MLC-LLM. The module automatically handles LongRoPE configuration, tied embeddings, and partial rotary factor for efficient inference on various hardware targets.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/phi3/phi3_model.py

Signature

@dataclasses.dataclass
class Phi3Config(ConfigBase):
    model_type: str
    hidden_size: int
    vocab_size: int
    num_hidden_layers: int
    num_attention_heads: int
    intermediate_size: int
    rms_norm_eps: float
    num_key_value_heads: int
    max_position_embeddings: int
    rope_scaling: Optional[Dict[str, Any]] = None
    original_max_position_embeddings: int = 0
    tie_word_embeddings: bool = False
    partial_rotary_factor: float = 1.0
    ...

class Phi3Embedding(nn.Embedding):
    def lm_head_forward(self, x: nn.Tensor): ...

class Phi3MLP(nn.Module):
    def __init__(self, config: Phi3Config): ...
    def forward(self, hidden_states: Tensor): ...

class PhiMHA(nn.Module):
    def __init__(self, config: Phi3Config): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class Phi3ForCausalLM(nn.Module):
    def __init__(self, config: Phi3Config): ...
    def get_logits(self, hidden_states: Tensor): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.phi3.phi3_model import Phi3Config, Phi3ForCausalLM

I/O Contract

Method	Input	Output	Description
embed	input_ids: Tensor[seq_len] (int32)	Tensor[1, seq_len, hidden_size]	Converts token IDs to embeddings via Phi3Embedding
prefill	input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Processes full prompt; returns last-token logits (uses tied or separate LM head)
decode	input_embed: Tensor[1, 1, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Single-token autoregressive decoding
batch_prefill	input_embeds, logit_positions, paged_kv_cache	(logits, paged_kv_cache)	Batched prefill with selective logit extraction
batch_decode	input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Batched single-token decoding
batch_verify	input_embeds, paged_kv_cache	(logits, paged_kv_cache)	Batched speculative verification

Architectural Feature	Details
Normalization	RMSNorm (no bias)
Activation	SiLU gate in the MLP
Weight Tying	Optional tied_word_embeddings: embedding weight shared with LM head via transposed matmul
RoPE Scaling	LongRoPE with short_factor and long_factor extension arrays; "su" type auto-converted to "longrope"
Partial Rotary	Configurable partial_rotary_factor (default 1.0, meaning full rotary)
Bias	No bias on attention or MLP projections

Usage Examples

# Phi-3 Mini configuration with LongRoPE
config = Phi3Config(
    model_type="phi3",
    hidden_size=3072,
    vocab_size=32064,
    num_hidden_layers=32,
    num_attention_heads=32,
    intermediate_size=8192,
    rms_norm_eps=1e-5,
    num_key_value_heads=32,
    max_position_embeddings=131072,
    rope_scaling={
        "type": "longrope",
        "short_factor": [1.0] * 48,
        "long_factor": [1.0] * 48,
    },
    original_max_position_embeddings=4096,
    tie_word_embeddings=False,
)

model = Phi3ForCausalLM(config)
model.to("float16")
spec = model.get_default_spec()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment