Implementation:OpenGVLab InternVL Phi3 Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Language Model, Transformer Architecture, Causal LM
Last Updated	2026-02-07 14:00 GMT

Overview

Full PyTorch implementation of the Phi-3 transformer model providing causal language modeling, sequence classification, and token classification heads, with support for eager, Flash Attention 2, and SDPA attention backends plus SU and YaRN RoPE scaling for extended context lengths.

Description

This module implements the complete Phi-3 decoder-only transformer architecture adapted from HuggingFace's LLaMA/Mistral implementations with Phi-3-specific design choices:

Normalization: Uses Phi3RMSNorm (equivalent to T5LayerNorm) for pre-layer normalization in each decoder layer.

Positional Encoding: Implements three Rotary Position Embedding variants: base (Phi3RotaryEmbedding), SU scaling (Phi3SuScaledRotaryEmbedding) with short/long frequency factors and logarithmic scaling, and YaRN scaling (Phi3YarnScaledRotaryEmbedding) with linear logarithmic scaling. These enable extending context windows beyond the original training length.

MLP: A gated SiLU feedforward network (Phi3MLP) using a fused gate-up projection that produces a 2x intermediate-sized output, which is split into gate and up-projection paths.

Attention Backends: Three implementations registered in PHI3_ATTENTION_CLASSES: eager (Phi3Attention) for standard softmax attention, Flash Attention 2 (Phi3FlashAttention2) with sliding window attention support and variable-length handling, and SDPA (Phi3SdpaAttention) using PyTorch's native scaled dot-product attention.

Decoder Layer: Each Phi3DecoderLayer applies input layer normalization, self-attention with residual dropout, post-attention layer normalization, MLP with residual dropout, and residual connections.

Model Heads: Four variants: Phi3Model (base transformer using DynamicCache for KV caching), Phi3ForCausalLM (language model head with shifted cross-entropy loss), Phi3ForSequenceClassification, and Phi3ForTokenClassification.

Usage

Use this module as an alternative language model backbone within InternVL. The Phi3ForCausalLM class is instantiated via from_pretrained and processes combined visual and textual embeddings. The model automatically falls back to eager attention when flash-attn is unavailable.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat/internvl/model/phi3/modeling_phi3.py
Lines: 1-1610

Signature

class Phi3Model(Phi3PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None,
                output_attentions=None, output_hidden_states=None,
                return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
        ...

class Phi3ForCausalLM(Phi3PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, labels=None,
                use_cache=None, output_attentions=None,
                output_hidden_states=None, return_dict=None
                ) -> Union[Tuple, CausalLMOutputWithPast]:
        ...

class Phi3ForSequenceClassification(Phi3PreTrainedModel):
    ...

class Phi3ForTokenClassification(Phi3PreTrainedModel):
    ...

Import

from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
input_ids	torch.LongTensor	No	Token indices of shape (batch_size, seq_length)
attention_mask	torch.Tensor	No	Attention mask; 1 for non-masked, 0 for masked
position_ids	torch.LongTensor	No	Position indices for RoPE; auto-generated if not provided
past_key_values	Cache or Tuple	No	Cached key-value states; supports both Cache and legacy tuple format
inputs_embeds	torch.FloatTensor	No	Pre-computed embeddings from multimodal pipeline
labels	torch.LongTensor	No	Labels for loss computation; -100 for ignored positions
use_cache	bool	No	Return cached key-value states for autoregressive decoding

Outputs

Name	Type	Description
loss	torch.FloatTensor	Cross-entropy loss when labels provided
logits	torch.FloatTensor	Token prediction logits of shape (batch_size, seq_length, vocab_size)
past_key_values	Cache or Tuple	Cached key-value states for next step
hidden_states	Tuple[torch.FloatTensor]	All layer hidden states (optional)
attentions	Tuple[torch.FloatTensor]	All layer attention weights (optional)

Usage Examples

Basic Usage

from transformers import AutoTokenizer
from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM

model = Phi3ForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "This is an example script."
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment