Implementation:Mlc ai Mlc llm Phi3 Model
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Large Language Models, Model Architecture |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the Microsoft Phi-3 transformer-based language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.
Description
This module defines the complete Phi-3 model architecture, which represents a significant evolution from Phi-2. Phi-3 moves to a more standard Llama-like architecture with RMSNorm, SiLU-gated MLP, and support for LongRoPE scaling. A notable feature is the optional tied word embeddings, where the embedding weight matrix is shared with the LM head via a transposed matrix multiplication.
The module contains the following key classes:
- Phi3Config -- Configuration dataclass supporting RoPE scaling (LongRoPE with "su"/"longrope" type), tied word embeddings, partial rotary factor, and standard transformer hyperparameters.
- Phi3Embedding -- A specialized embedding class that extends nn.Embedding with a lm_head_forward method enabling weight tying between the input embedding and output projection.
- Phi3MLP -- A gated MLP using a fused gate-up projection with SiLU activation and a down projection, with no bias terms.
- PhiMHA -- Multi-head attention with grouped-query attention (GQA) support, fused QKV projection without bias, and PagedKVCache integration.
- Phi3ParallelBlock -- A transformer block with pre-norm RMSNorm on both attention and MLP sub-layers, sequential (not parallel) attention-then-MLP computation, and residual connections with tensor-parallel all-reduce support.
- Phi3Model -- The backbone model with Phi3Embedding, a stack of Phi3ParallelBlock layers, and a final RMSNorm.
- Phi3ForCausalLM -- The top-level causal LM supporting optional weight tying, LongRoPE scaling with extension factors, and partial rotary embeddings.
Usage
Use this module when compiling and deploying Phi-3 models (including Phi-3-mini, Phi-3-small, and Phi-3-medium variants) through MLC-LLM. The module automatically handles LongRoPE configuration, tied embeddings, and partial rotary factor for efficient inference on various hardware targets.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/phi3/phi3_model.py
Signature
@dataclasses.dataclass
class Phi3Config(ConfigBase):
model_type: str
hidden_size: int
vocab_size: int
num_hidden_layers: int
num_attention_heads: int
intermediate_size: int
rms_norm_eps: float
num_key_value_heads: int
max_position_embeddings: int
rope_scaling: Optional[Dict[str, Any]] = None
original_max_position_embeddings: int = 0
tie_word_embeddings: bool = False
partial_rotary_factor: float = 1.0
...
class Phi3Embedding(nn.Embedding):
def lm_head_forward(self, x: nn.Tensor): ...
class Phi3MLP(nn.Module):
def __init__(self, config: Phi3Config): ...
def forward(self, hidden_states: Tensor): ...
class PhiMHA(nn.Module):
def __init__(self, config: Phi3Config): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class Phi3ForCausalLM(nn.Module):
def __init__(self, config: Phi3Config): ...
def get_logits(self, hidden_states: Tensor): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.phi3.phi3_model import Phi3Config, Phi3ForCausalLM
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| embed | input_ids: Tensor[seq_len] (int32) | Tensor[1, seq_len, hidden_size] | Converts token IDs to embeddings via Phi3Embedding |
| prefill | input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Processes full prompt; returns last-token logits (uses tied or separate LM head) |
| decode | input_embed: Tensor[1, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Single-token autoregressive decoding |
| batch_prefill | input_embeds, logit_positions, paged_kv_cache | (logits, paged_kv_cache) | Batched prefill with selective logit extraction |
| batch_decode | input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Batched single-token decoding |
| batch_verify | input_embeds, paged_kv_cache | (logits, paged_kv_cache) | Batched speculative verification |
| Architectural Feature | Details |
|---|---|
| Normalization | RMSNorm (no bias) |
| Activation | SiLU gate in the MLP |
| Weight Tying | Optional tied_word_embeddings: embedding weight shared with LM head via transposed matmul |
| RoPE Scaling | LongRoPE with short_factor and long_factor extension arrays; "su" type auto-converted to "longrope" |
| Partial Rotary | Configurable partial_rotary_factor (default 1.0, meaning full rotary) |
| Bias | No bias on attention or MLP projections |
Usage Examples
# Phi-3 Mini configuration with LongRoPE
config = Phi3Config(
model_type="phi3",
hidden_size=3072,
vocab_size=32064,
num_hidden_layers=32,
num_attention_heads=32,
intermediate_size=8192,
rms_norm_eps=1e-5,
num_key_value_heads=32,
max_position_embeddings=131072,
rope_scaling={
"type": "longrope",
"short_factor": [1.0] * 48,
"long_factor": [1.0] * 48,
},
original_max_position_embeddings=4096,
tie_word_embeddings=False,
)
model = Phi3ForCausalLM(config)
model.to("float16")
spec = model.get_default_spec()