Implementation:OpenGVLab InternVL Phi3 Model
| Knowledge Sources | |
|---|---|
| Domains | Language Model, Transformer Architecture, Causal LM |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Full PyTorch implementation of the Phi-3 transformer model providing causal language modeling, sequence classification, and token classification heads, with support for eager, Flash Attention 2, and SDPA attention backends plus SU and YaRN RoPE scaling for extended context lengths.
Description
This module implements the complete Phi-3 decoder-only transformer architecture adapted from HuggingFace's LLaMA/Mistral implementations with Phi-3-specific design choices:
Normalization: Uses Phi3RMSNorm (equivalent to T5LayerNorm) for pre-layer normalization in each decoder layer.
Positional Encoding: Implements three Rotary Position Embedding variants: base (Phi3RotaryEmbedding), SU scaling (Phi3SuScaledRotaryEmbedding) with short/long frequency factors and logarithmic scaling, and YaRN scaling (Phi3YarnScaledRotaryEmbedding) with linear logarithmic scaling. These enable extending context windows beyond the original training length.
MLP: A gated SiLU feedforward network (Phi3MLP) using a fused gate-up projection that produces a 2x intermediate-sized output, which is split into gate and up-projection paths.
Attention Backends: Three implementations registered in PHI3_ATTENTION_CLASSES: eager (Phi3Attention) for standard softmax attention, Flash Attention 2 (Phi3FlashAttention2) with sliding window attention support and variable-length handling, and SDPA (Phi3SdpaAttention) using PyTorch's native scaled dot-product attention.
Decoder Layer: Each Phi3DecoderLayer applies input layer normalization, self-attention with residual dropout, post-attention layer normalization, MLP with residual dropout, and residual connections.
Model Heads: Four variants: Phi3Model (base transformer using DynamicCache for KV caching), Phi3ForCausalLM (language model head with shifted cross-entropy loss), Phi3ForSequenceClassification, and Phi3ForTokenClassification.
Usage
Use this module as an alternative language model backbone within InternVL. The Phi3ForCausalLM class is instantiated via from_pretrained and processes combined visual and textual embeddings. The model automatically falls back to eager attention when flash-attn is unavailable.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat/internvl/model/phi3/modeling_phi3.py
- Lines: 1-1610
Signature
class Phi3Model(Phi3PreTrainedModel):
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
past_key_values=None, inputs_embeds=None, use_cache=None,
output_attentions=None, output_hidden_states=None,
return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
...
class Phi3ForCausalLM(Phi3PreTrainedModel):
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
past_key_values=None, inputs_embeds=None, labels=None,
use_cache=None, output_attentions=None,
output_hidden_states=None, return_dict=None
) -> Union[Tuple, CausalLMOutputWithPast]:
...
class Phi3ForSequenceClassification(Phi3PreTrainedModel):
...
class Phi3ForTokenClassification(Phi3PreTrainedModel):
...
Import
from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | No | Token indices of shape (batch_size, seq_length) |
| attention_mask | torch.Tensor | No | Attention mask; 1 for non-masked, 0 for masked |
| position_ids | torch.LongTensor | No | Position indices for RoPE; auto-generated if not provided |
| past_key_values | Cache or Tuple | No | Cached key-value states; supports both Cache and legacy tuple format |
| inputs_embeds | torch.FloatTensor | No | Pre-computed embeddings from multimodal pipeline |
| labels | torch.LongTensor | No | Labels for loss computation; -100 for ignored positions |
| use_cache | bool | No | Return cached key-value states for autoregressive decoding |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | torch.FloatTensor | Cross-entropy loss when labels provided |
| logits | torch.FloatTensor | Token prediction logits of shape (batch_size, seq_length, vocab_size) |
| past_key_values | Cache or Tuple | Cached key-value states for next step |
| hidden_states | Tuple[torch.FloatTensor] | All layer hidden states (optional) |
| attentions | Tuple[torch.FloatTensor] | All layer attention weights (optional) |
Usage Examples
Basic Usage
from transformers import AutoTokenizer
from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM
model = Phi3ForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
prompt = "This is an example script."
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]