Implementation:Mlc ai Mlc llm Phi Model
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Large Language Models, Model Architecture |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the Microsoft Phi family (Phi-1, Phi-1.5, and Phi-2) transformer-based language model architectures for deployment through the MLC-LLM compilation pipeline using TVM Relax.
Description
This module provides the complete implementation for the Microsoft Phi family of small language models, covering three model variants: Phi-1/Phi-1.5 (via Phi1Config) and Phi-2 (via PhiConfig). A key architectural distinction of the Phi models is the parallel attention-MLP block, where the attention and FFN sub-layers are computed in parallel rather than sequentially, then summed with the residual.
The module contains the following key classes:
- Phi1Config -- Configuration for Phi-1 and Phi-1.5 models, using standard naming conventions (hidden_size, num_attention_heads, etc.). Supports partial rotary factor for RoPE.
- PhiConfig -- Configuration for Phi-2, using Microsoft-specific naming conventions (n_embd, n_head, n_layer, etc.). Includes a static method from_phi1 to convert Phi1Config into PhiConfig for unified model handling.
- PhiMLP -- A two-layer MLP using GELU activation with tanh approximation, with bias on both layers.
- PhiMHA -- Multi-head attention with grouped-query attention support, fused QKV projection with bias, and PagedKVCache integration.
- PhiParallelBlock -- The parallel transformer block where LayerNorm is applied once, then attention and MLP are computed in parallel. The outputs are summed with the residual in a single step. Supports tensor-parallel bias sharding.
- PhiCausalLMHead -- An LM head with a LayerNorm followed by a linear projection to vocabulary size.
- PhiModel -- The transformer backbone with embedding and a stack of parallel blocks.
- PhiForCausalLM -- The top-level model providing standard inference methods (embed, prefill, decode, batch operations) and paged KV cache creation with partial rotary dimension support.
Usage
Use this module when compiling and deploying Phi-1, Phi-1.5, or Phi-2 models through MLC-LLM. The module automatically handles both Phi1Config and PhiConfig formats, converting Phi1Config to PhiConfig internally. It supports tensor parallelism and paged KV cache for efficient inference.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/phi/phi_model.py
Signature
@dataclasses.dataclass
class Phi1Config(ConfigBase):
vocab_size: int = 51200
hidden_size: int = 2048
intermediate_size: int = 8192
num_hidden_layers: int = 24
num_attention_heads: int = 32
layer_norm_eps: float = 1e-5
partial_rotary_factor: float = 0.5
...
@dataclasses.dataclass
class PhiConfig(ConfigBase):
model_type: str
vocab_size: int = 51200
n_positions: int = 2048
n_embd: int = 2560
n_layer: int = 32
n_inner: int = 0
n_head: int = 32
rotary_dim: int = 32
...
@staticmethod
def from_phi1(config: Phi1Config) -> "PhiConfig": ...
class PhiMLP(nn.Module):
def __init__(self, config: PhiConfig): ...
def forward(self, hidden_states: Tensor): ...
class PhiMHA(nn.Module):
def __init__(self, config: PhiConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class PhiParallelBlock(nn.Module):
def __init__(self, config: PhiConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class PhiCausalLMHead(nn.Module):
def __init__(self, config: PhiConfig): ...
def forward(self, hidden_states: Tensor): ...
class PhiForCausalLM(nn.Module):
def __init__(self, config: Union[PhiConfig, Phi1Config]): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_forward(self, input_embeds, paged_kv_cache, logit_positions=None): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.phi.phi_model import PhiConfig, Phi1Config, PhiForCausalLM
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| embed | input_ids: Tensor[seq_len] (int32) | Tensor[1, seq_len, n_embd] | Converts token IDs to embeddings |
| prefill | input_embed: Tensor[1, seq_len, n_embd], paged_kv_cache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Full prompt processing; extracts last-token logits |
| decode | input_embed: Tensor[1, 1, n_embd], paged_kv_cache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Single-token autoregressive decoding |
| batch_prefill | input_embeds, logit_positions, paged_kv_cache | (logits, paged_kv_cache) | Batched prefill with selective logit extraction |
| batch_decode | input_embeds: Tensor[batch_size, 1, n_embd], paged_kv_cache | (logits, paged_kv_cache) | Batched single-token decoding |
| batch_verify | input_embeds, paged_kv_cache | (logits, paged_kv_cache) | Batched speculative verification |
| Architectural Feature | Details |
|---|---|
| Parallel Block | Attention and MLP are computed in parallel from the same LayerNorm output, then combined with the residual |
| Activation Function | GELU with tanh approximation |
| Bias | Both attention (QKV and output) and MLP (fc1 and fc2) projections include bias terms |
| Rotary Embedding | Partial rotary (default 50% of head_dim for Phi-1/1.5; configurable rotary_dim for Phi-2) |
| Normalization | LayerNorm (not RMSNorm) |
Usage Examples
# Using Phi-2 configuration
config = PhiConfig(
model_type="phi",
vocab_size=51200,
n_embd=2560,
n_layer=32,
n_head=32,
rotary_dim=32,
context_window_size=2048,
)
model = PhiForCausalLM(config)
model.to("float16")
# Using Phi-1.5 configuration (auto-converted to PhiConfig internally)
phi1_config = Phi1Config(
vocab_size=51200,
hidden_size=2048,
num_hidden_layers=24,
num_attention_heads=32,
context_window_size=2048,
)
model = PhiForCausalLM(phi1_config)