Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL Phi3 Model

From Leeroopedia


Knowledge Sources
Domains Language Model, Transformer Architecture, Causal LM
Last Updated 2026-02-07 14:00 GMT

Overview

Full PyTorch implementation of the Phi-3 transformer model providing causal language modeling, sequence classification, and token classification heads, with support for eager, Flash Attention 2, and SDPA attention backends plus SU and YaRN RoPE scaling for extended context lengths.

Description

This module implements the complete Phi-3 decoder-only transformer architecture adapted from HuggingFace's LLaMA/Mistral implementations with Phi-3-specific design choices:

Normalization: Uses Phi3RMSNorm (equivalent to T5LayerNorm) for pre-layer normalization in each decoder layer.

Positional Encoding: Implements three Rotary Position Embedding variants: base (Phi3RotaryEmbedding), SU scaling (Phi3SuScaledRotaryEmbedding) with short/long frequency factors and logarithmic scaling, and YaRN scaling (Phi3YarnScaledRotaryEmbedding) with linear logarithmic scaling. These enable extending context windows beyond the original training length.

MLP: A gated SiLU feedforward network (Phi3MLP) using a fused gate-up projection that produces a 2x intermediate-sized output, which is split into gate and up-projection paths.

Attention Backends: Three implementations registered in PHI3_ATTENTION_CLASSES: eager (Phi3Attention) for standard softmax attention, Flash Attention 2 (Phi3FlashAttention2) with sliding window attention support and variable-length handling, and SDPA (Phi3SdpaAttention) using PyTorch's native scaled dot-product attention.

Decoder Layer: Each Phi3DecoderLayer applies input layer normalization, self-attention with residual dropout, post-attention layer normalization, MLP with residual dropout, and residual connections.

Model Heads: Four variants: Phi3Model (base transformer using DynamicCache for KV caching), Phi3ForCausalLM (language model head with shifted cross-entropy loss), Phi3ForSequenceClassification, and Phi3ForTokenClassification.

Usage

Use this module as an alternative language model backbone within InternVL. The Phi3ForCausalLM class is instantiated via from_pretrained and processes combined visual and textual embeddings. The model automatically falls back to eager attention when flash-attn is unavailable.

Code Reference

Source Location

Signature

class Phi3Model(Phi3PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None,
                output_attentions=None, output_hidden_states=None,
                return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
        ...

class Phi3ForCausalLM(Phi3PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, labels=None,
                use_cache=None, output_attentions=None,
                output_hidden_states=None, return_dict=None
                ) -> Union[Tuple, CausalLMOutputWithPast]:
        ...

class Phi3ForSequenceClassification(Phi3PreTrainedModel):
    ...

class Phi3ForTokenClassification(Phi3PreTrainedModel):
    ...

Import

from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM

I/O Contract

Inputs

Name Type Required Description
input_ids torch.LongTensor No Token indices of shape (batch_size, seq_length)
attention_mask torch.Tensor No Attention mask; 1 for non-masked, 0 for masked
position_ids torch.LongTensor No Position indices for RoPE; auto-generated if not provided
past_key_values Cache or Tuple No Cached key-value states; supports both Cache and legacy tuple format
inputs_embeds torch.FloatTensor No Pre-computed embeddings from multimodal pipeline
labels torch.LongTensor No Labels for loss computation; -100 for ignored positions
use_cache bool No Return cached key-value states for autoregressive decoding

Outputs

Name Type Description
loss torch.FloatTensor Cross-entropy loss when labels provided
logits torch.FloatTensor Token prediction logits of shape (batch_size, seq_length, vocab_size)
past_key_values Cache or Tuple Cached key-value states for next step
hidden_states Tuple[torch.FloatTensor] All layer hidden states (optional)
attentions Tuple[torch.FloatTensor] All layer attention weights (optional)

Usage Examples

Basic Usage

from transformers import AutoTokenizer
from internvl.model.phi3.modeling_phi3 import Phi3ForCausalLM

model = Phi3ForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

prompt = "This is an example script."
inputs = tokenizer(prompt, return_tensors="pt")
generate_ids = model.generate(inputs.input_ids, max_length=30)
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment