Implementation:Mlc ai Mlc llm Phi Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Machine Learning, Large Language Models, Model Architecture
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the Microsoft Phi family (Phi-1, Phi-1.5, and Phi-2) transformer-based language model architectures for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module provides the complete implementation for the Microsoft Phi family of small language models, covering three model variants: Phi-1/Phi-1.5 (via Phi1Config) and Phi-2 (via PhiConfig). A key architectural distinction of the Phi models is the parallel attention-MLP block, where the attention and FFN sub-layers are computed in parallel rather than sequentially, then summed with the residual.

The module contains the following key classes:

Phi1Config -- Configuration for Phi-1 and Phi-1.5 models, using standard naming conventions (hidden_size, num_attention_heads, etc.). Supports partial rotary factor for RoPE.
PhiConfig -- Configuration for Phi-2, using Microsoft-specific naming conventions (n_embd, n_head, n_layer, etc.). Includes a static method from_phi1 to convert Phi1Config into PhiConfig for unified model handling.
PhiMLP -- A two-layer MLP using GELU activation with tanh approximation, with bias on both layers.
PhiMHA -- Multi-head attention with grouped-query attention support, fused QKV projection with bias, and PagedKVCache integration.
PhiParallelBlock -- The parallel transformer block where LayerNorm is applied once, then attention and MLP are computed in parallel. The outputs are summed with the residual in a single step. Supports tensor-parallel bias sharding.
PhiCausalLMHead -- An LM head with a LayerNorm followed by a linear projection to vocabulary size.
PhiModel -- The transformer backbone with embedding and a stack of parallel blocks.
PhiForCausalLM -- The top-level model providing standard inference methods (embed, prefill, decode, batch operations) and paged KV cache creation with partial rotary dimension support.

Usage

Use this module when compiling and deploying Phi-1, Phi-1.5, or Phi-2 models through MLC-LLM. The module automatically handles both Phi1Config and PhiConfig formats, converting Phi1Config to PhiConfig internally. It supports tensor parallelism and paged KV cache for efficient inference.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/phi/phi_model.py

Signature

@dataclasses.dataclass
class Phi1Config(ConfigBase):
    vocab_size: int = 51200
    hidden_size: int = 2048
    intermediate_size: int = 8192
    num_hidden_layers: int = 24
    num_attention_heads: int = 32
    layer_norm_eps: float = 1e-5
    partial_rotary_factor: float = 0.5
    ...

@dataclasses.dataclass
class PhiConfig(ConfigBase):
    model_type: str
    vocab_size: int = 51200
    n_positions: int = 2048
    n_embd: int = 2560
    n_layer: int = 32
    n_inner: int = 0
    n_head: int = 32
    rotary_dim: int = 32
    ...
    @staticmethod
    def from_phi1(config: Phi1Config) -> "PhiConfig": ...

class PhiMLP(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor): ...

class PhiMHA(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class PhiParallelBlock(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class PhiCausalLMHead(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor): ...

class PhiForCausalLM(nn.Module):
    def __init__(self, config: Union[PhiConfig, Phi1Config]): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, input_embeds, paged_kv_cache, logit_positions=None): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.phi.phi_model import PhiConfig, Phi1Config, PhiForCausalLM

I/O Contract

Method	Input	Output	Description
embed	input_ids: Tensor[seq_len] (int32)	Tensor[1, seq_len, n_embd]	Converts token IDs to embeddings
prefill	input_embed: Tensor[1, seq_len, n_embd], paged_kv_cache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Full prompt processing; extracts last-token logits
decode	input_embed: Tensor[1, 1, n_embd], paged_kv_cache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Single-token autoregressive decoding
batch_prefill	input_embeds, logit_positions, paged_kv_cache	(logits, paged_kv_cache)	Batched prefill with selective logit extraction
batch_decode	input_embeds: Tensor[batch_size, 1, n_embd], paged_kv_cache	(logits, paged_kv_cache)	Batched single-token decoding
batch_verify	input_embeds, paged_kv_cache	(logits, paged_kv_cache)	Batched speculative verification

Architectural Feature	Details
Parallel Block	Attention and MLP are computed in parallel from the same LayerNorm output, then combined with the residual
Activation Function	GELU with tanh approximation
Bias	Both attention (QKV and output) and MLP (fc1 and fc2) projections include bias terms
Rotary Embedding	Partial rotary (default 50% of head_dim for Phi-1/1.5; configurable rotary_dim for Phi-2)
Normalization	LayerNorm (not RMSNorm)

Usage Examples

# Using Phi-2 configuration
config = PhiConfig(
    model_type="phi",
    vocab_size=51200,
    n_embd=2560,
    n_layer=32,
    n_head=32,
    rotary_dim=32,
    context_window_size=2048,
)
model = PhiForCausalLM(config)
model.to("float16")

# Using Phi-1.5 configuration (auto-converted to PhiConfig internally)
phi1_config = Phi1Config(
    vocab_size=51200,
    hidden_size=2048,
    num_hidden_layers=24,
    num_attention_heads=32,
    context_window_size=2048,
)
model = PhiForCausalLM(phi1_config)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment