Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Phi Model

From Leeroopedia
Revision as of 15:51, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mlc_ai_Mlc_llm_Phi_Model.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Machine Learning, Large Language Models, Model Architecture
Last Updated 2026-02-09 19:00 GMT

Overview

Implements the Microsoft Phi family (Phi-1, Phi-1.5, and Phi-2) transformer-based language model architectures for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module provides the complete implementation for the Microsoft Phi family of small language models, covering three model variants: Phi-1/Phi-1.5 (via Phi1Config) and Phi-2 (via PhiConfig). A key architectural distinction of the Phi models is the parallel attention-MLP block, where the attention and FFN sub-layers are computed in parallel rather than sequentially, then summed with the residual.

The module contains the following key classes:

  • Phi1Config -- Configuration for Phi-1 and Phi-1.5 models, using standard naming conventions (hidden_size, num_attention_heads, etc.). Supports partial rotary factor for RoPE.
  • PhiConfig -- Configuration for Phi-2, using Microsoft-specific naming conventions (n_embd, n_head, n_layer, etc.). Includes a static method from_phi1 to convert Phi1Config into PhiConfig for unified model handling.
  • PhiMLP -- A two-layer MLP using GELU activation with tanh approximation, with bias on both layers.
  • PhiMHA -- Multi-head attention with grouped-query attention support, fused QKV projection with bias, and PagedKVCache integration.
  • PhiParallelBlock -- The parallel transformer block where LayerNorm is applied once, then attention and MLP are computed in parallel. The outputs are summed with the residual in a single step. Supports tensor-parallel bias sharding.
  • PhiCausalLMHead -- An LM head with a LayerNorm followed by a linear projection to vocabulary size.
  • PhiModel -- The transformer backbone with embedding and a stack of parallel blocks.
  • PhiForCausalLM -- The top-level model providing standard inference methods (embed, prefill, decode, batch operations) and paged KV cache creation with partial rotary dimension support.

Usage

Use this module when compiling and deploying Phi-1, Phi-1.5, or Phi-2 models through MLC-LLM. The module automatically handles both Phi1Config and PhiConfig formats, converting Phi1Config to PhiConfig internally. It supports tensor parallelism and paged KV cache for efficient inference.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class Phi1Config(ConfigBase):
    vocab_size: int = 51200
    hidden_size: int = 2048
    intermediate_size: int = 8192
    num_hidden_layers: int = 24
    num_attention_heads: int = 32
    layer_norm_eps: float = 1e-5
    partial_rotary_factor: float = 0.5
    ...

@dataclasses.dataclass
class PhiConfig(ConfigBase):
    model_type: str
    vocab_size: int = 51200
    n_positions: int = 2048
    n_embd: int = 2560
    n_layer: int = 32
    n_inner: int = 0
    n_head: int = 32
    rotary_dim: int = 32
    ...
    @staticmethod
    def from_phi1(config: Phi1Config) -> "PhiConfig": ...

class PhiMLP(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor): ...

class PhiMHA(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class PhiParallelBlock(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class PhiCausalLMHead(nn.Module):
    def __init__(self, config: PhiConfig): ...
    def forward(self, hidden_states: Tensor): ...

class PhiForCausalLM(nn.Module):
    def __init__(self, config: Union[PhiConfig, Phi1Config]): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, input_embeds, paged_kv_cache, logit_positions=None): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.phi.phi_model import PhiConfig, Phi1Config, PhiForCausalLM

I/O Contract

Method Input Output Description
embed input_ids: Tensor[seq_len] (int32) Tensor[1, seq_len, n_embd] Converts token IDs to embeddings
prefill input_embed: Tensor[1, seq_len, n_embd], paged_kv_cache (logits: Tensor[1, 1, vocab_size], paged_kv_cache) Full prompt processing; extracts last-token logits
decode input_embed: Tensor[1, 1, n_embd], paged_kv_cache (logits: Tensor[1, 1, vocab_size], paged_kv_cache) Single-token autoregressive decoding
batch_prefill input_embeds, logit_positions, paged_kv_cache (logits, paged_kv_cache) Batched prefill with selective logit extraction
batch_decode input_embeds: Tensor[batch_size, 1, n_embd], paged_kv_cache (logits, paged_kv_cache) Batched single-token decoding
batch_verify input_embeds, paged_kv_cache (logits, paged_kv_cache) Batched speculative verification
Architectural Feature Details
Parallel Block Attention and MLP are computed in parallel from the same LayerNorm output, then combined with the residual
Activation Function GELU with tanh approximation
Bias Both attention (QKV and output) and MLP (fc1 and fc2) projections include bias terms
Rotary Embedding Partial rotary (default 50% of head_dim for Phi-1/1.5; configurable rotary_dim for Phi-2)
Normalization LayerNorm (not RMSNorm)

Usage Examples

# Using Phi-2 configuration
config = PhiConfig(
    model_type="phi",
    vocab_size=51200,
    n_embd=2560,
    n_layer=32,
    n_head=32,
    rotary_dim=32,
    context_window_size=2048,
)
model = PhiForCausalLM(config)
model.to("float16")

# Using Phi-1.5 configuration (auto-converted to PhiConfig internally)
phi1_config = Phi1Config(
    vocab_size=51200,
    hidden_size=2048,
    num_hidden_layers=24,
    num_attention_heads=32,
    context_window_size=2048,
)
model = PhiForCausalLM(phi1_config)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment