Environment:OpenGVLab InternVL DeepSpeed

Knowledge Sources	OpenGVLab/InternVL DeepSpeed
Domains	Infrastructure, Distributed_Training, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

DeepSpeed == 0.13.5 distributed training framework providing ZeRO optimizer stages for memory-efficient training of InternVL models from 1B to 108B parameters.

Description

DeepSpeed is used as the primary distributed training backend for all InternVL training workflows. It provides ZeRO (Zero Redundancy Optimizer) memory optimization at three stages: Stage 1 (optimizer state partitioning), Stage 2 (+ gradient partitioning), and Stage 3 (+ parameter partitioning). Different ZeRO stages are selected based on model size: smaller models (1B-8B) use Stage 1, while larger models (26B-108B) use Stage 3. DeepSpeed also handles distributed initialization via `deepspeed.init_distributed()`.

Usage

Use this environment for all training workflows (SFT, LoRA, pretraining, MPO/DPO). DeepSpeed is initialized at the start of every training script and manages distributed communication, memory optimization, and mixed-precision training. The DPO trainer includes special DeepSpeed preparation methods for handling reference models.

System Requirements

Category	Requirement	Notes
Hardware	Multi-GPU setup (1-512 GPUs)	Single-node or multi-node via SLURM/OpenMPI
Network	High-bandwidth GPU interconnect	NVLink/NVSwitch for intra-node; InfiniBand for inter-node
OS	Linux	DeepSpeed not supported on Windows/macOS

Dependencies

Python Packages

`deepspeed` == 0.13.5
`torch` >= 2.0 (prerequisite)
`accelerate` (for HuggingFace Trainer integration)

Configuration Files

Training shell scripts reference DeepSpeed JSON configs:

`zero_stage1_config.json` (for models 1B-8B)
`zero_stage3_config.json` (for models 26B-78B)
`zero_stage3_config_34b.json` (for 34B-40B models)
`zero_stage3_config_100b.json` (for 78B-108B models)

Credentials

No credentials required. Distributed environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT`) must be set by the launcher.

Quick Install

pip install deepspeed==0.13.5

Code Evidence

Distributed initialization from `dist_utils.py:6,48-51`:

import deepspeed

num_gpus = torch.cuda.device_count()
torch.cuda.set_device(rank % num_gpus)
deepspeed.init_distributed(dist_backend=backend)

DeepSpeed flag usage in training scripts (e.g., `internvl2_5_8b_dynamic_res_2nd_finetune_full.sh`):

--deepspeed internvl_chat/shell/zero_stage1_config.json

DPO Trainer DeepSpeed preparation from `trainer_dpo.py:10,178-198`:

import deepspeed

def _prepare_deepspeed(self, model):
    # Custom DeepSpeed preparation for DPO reference model
    deepspeed_plugin = self.accelerator.state.deepspeed_plugin
    config_kwargs = {**deepspeed_plugin.deepspeed_config}
    # ... configures ZeRO stage for reference model

Launcher selection from `internvl_chat_finetune.py:807-808`:

launcher = os.environ.get('LAUNCHER', 'slurm')
init_dist(launcher=launcher, backend='nccl')

Common Errors

Error Message	Cause	Solution
`DeepSpeed not found`	DeepSpeed package not installed	`pip install deepspeed==0.13.5`
`NCCL timeout`	Network issues in multi-node setup	Check `MASTER_ADDR`/`MASTER_PORT` and firewall rules
`CUDA OOM with ZeRO Stage 1`	Model too large for Stage 1	Switch to ZeRO Stage 3 config
`RuntimeError: DeepSpeed ZeRO-3 is not compatible`	HuggingFace Trainer version conflict	Ensure `deepspeed==0.13.5` and `transformers==4.37.2`

Compatibility Notes

ZeRO Stage Selection: Shell scripts use Stage 1 for models up to 8B parameters and Stage 3 for 26B+ models. Using the wrong stage can cause OOM or excessive communication overhead.
DeepSpeed must init before HfArgumentParser: A comment in `internvl_chat_finetune.py:806` states: "If use DeepSpeed zero3, init_dist must before HfArgumentParser." This ordering is critical.
DPO Reference Model: The `MultimodalDPOTrainer` includes custom DeepSpeed preparation (`_prepare_deepspeed`) to handle the reference model separately from the policy model.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment