Principle:OpenGVLab InternVL Supervised Training Loop

Knowledge Sources	HuggingFace Trainer InternVL
Domains	Training, Deep_Learning, Distributed_Computing
Last Updated	2026-02-07 00:00 GMT

Overview

A managed training loop that handles gradient computation, optimization, distributed training, checkpointing, and logging for supervised fine-tuning of vision-language models.

Description

The supervised training loop abstracts the boilerplate of training large models in distributed settings. Rather than writing a custom PyTorch training loop, the framework delegates to HuggingFace's Trainer class which provides:

Gradient accumulation: Simulates larger batch sizes across multiple forward passes
Distributed training: Integration with DeepSpeed ZeRO for memory-efficient multi-GPU training
Mixed precision: BF16/FP16 training for reduced memory and faster computation
Checkpointing: Periodic model saving with configurable strategies
Logging: Training metrics tracked via TensorBoard or Weights & Biases
Resume from checkpoint: Seamless training continuation after interruptions

The training loop operates on data produced by the data collator, which batches and pads variable-length multimodal sequences.

Usage

Use this principle when performing supervised fine-tuning (full parameter or LoRA) on InternVL models. The Trainer handles all aspects of the training loop; the user only needs to configure the model, dataset, and training arguments.

Theoretical Basis

The supervised training objective minimizes cross-entropy loss on the assistant's response tokens:

$ℒ_{S F T} = - \sum_{t} 𝟙 [t \in assistant] \log p_{θ} (x_{t} | x_{< t})$

Human turn tokens and image tokens are masked (label = -100) and excluded from loss computation.

The training loop with DeepSpeed ZeRO:

# Pseudo-code: Managed training loop
for batch in dataloader:
    # Forward pass with mixed precision
    with autocast(bf16=True):
        loss = model(input_ids, labels, pixel_values, image_flags).loss

    # Backward pass with gradient accumulation
    loss = loss / gradient_accumulation_steps
    loss.backward()

    if step % gradient_accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    # Periodic checkpointing
    if step % save_steps == 0:
        save_checkpoint()

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_Trainer_Train

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment