Workflow:OpenGVLab InternVL Supervised Finetuning

Knowledge Sources	InternVL InternVL Chat README HuggingFace InternVL2.5
Domains	VLMs, Fine_Tuning, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

End-to-end process for full-parameter supervised fine-tuning of InternVL multimodal vision-language models on custom datasets using DeepSpeed distributed training.

Description

This workflow covers the standard procedure for adapting a pre-trained InternVL model to domain-specific tasks by fine-tuning both the language model (LLM) and the MLP projector while keeping the vision encoder (InternViT) frozen. The process starts from a HuggingFace-hosted InternVL checkpoint, formats custom data into the required JSONL conversation structure, configures DeepSpeed ZeRO for distributed training, and produces a fine-tuned model checkpoint. This is the 2nd finetune path documented in the repository, intended for users who want to adapt an already instruction-tuned InternVL model to their specific domain.

Usage

Execute this workflow when you have a domain-specific multimodal dataset (image-text or video-text conversations) and need to adapt a pre-trained InternVL model to your task. This path updates all LLM and MLP parameters, making it suitable when you have sufficient GPU resources (8+ GPUs with 80GB VRAM each) and enough training data to justify full-parameter updates. Use the LoRA variant instead if GPU resources are limited.

Execution Steps

Step 1: Prepare Training Data

Format your custom dataset into JSONL files following the InternVL conversation schema. Each line must be a JSON object containing an optional image or video path and a conversations array with alternating human/gpt turns. Multi-image inputs are supported by providing an array of image paths and placing multiple <image> tokens in the text. Create a JSON meta-file that references each JSONL dataset shard with sampling weights.

Key considerations:

Each conversation entry needs a from field (human or gpt) and a value field
Image paths should be relative to a configurable root directory
Use <image> placeholder tokens in the human turn text to indicate where visual input is inserted
The meta JSON file controls dataset mixing ratios via sampling weights and repeat factors
Video data is supported through frame extraction with configurable sampling strategies

Step 2: Configure Training Environment

Set up the distributed training configuration including DeepSpeed ZeRO stage, number of GPUs, batch size, and gradient accumulation. Select the appropriate DeepSpeed config based on model size: ZeRO Stage 1 for models up to 8B parameters, ZeRO Stage 3 for larger models (26B+), and ZeRO Stage 3 with CPU offloading for 78B+ models.

Key considerations:

Match the DeepSpeed ZeRO stage to your model size and available GPU memory
Total effective batch size = num_gpus x per_device_batch_size x gradient_accumulation_steps
Enable bf16 mixed precision for training stability on modern GPUs
Set gradient checkpointing to reduce memory at the cost of compute

Step 3: Load Base Model

Load the pre-trained InternVL model from HuggingFace Hub. The model consists of three components: the InternViT vision encoder, an MLP projector, and a language model backbone. Configure which components to freeze: in the standard 2nd finetune path, the vision backbone is frozen while the LLM and MLP are trainable.

Key considerations:

The vision backbone (InternViT) is frozen by default to preserve learned visual representations
Both the LLM and MLP projector are unfrozen for full adaptation
A drop path rate of 0.1 is applied during full finetuning for regularization
Dynamic image resolution with configurable patch counts (1-12 patches of 448x448) is enabled

Step 4: Train Model

Launch distributed training using the HuggingFace Trainer with DeepSpeed integration. The trainer handles data loading with dynamic batching, loss computation on assistant tokens only, gradient accumulation, and checkpoint saving. Image tokens in the input are replaced with visual features extracted by the frozen vision encoder.

Key considerations:

Loss is computed only on assistant (gpt) response tokens; human prompts and image tokens are masked
Packed sequence training can be enabled to pack multiple samples into fixed-length sequences for GPU efficiency
The conversation template must match the LLM backbone (InternLM2, Phi-3, Qwen2, etc.)
Checkpoints are saved at configurable intervals with optional evaluation

Step 5: Validate and Export Model

After training completes, the final checkpoint is saved in the HuggingFace model format. Optionally convert between custom and HuggingFace weight formats using the provided conversion tools. The fine-tuned model can be loaded directly for inference using the standard InternVLChatModel API.

Key considerations:

Use the custom-to-HF conversion tool if deploying to HuggingFace Hub
Verify the model loads correctly and produces reasonable outputs on sample inputs
The tokenizer is saved alongside the model weights automatically

Execution Diagram

GitHub URL

Workflow Repository