Environment:OpenGVLab InternVL DeepSpeed
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
DeepSpeed == 0.13.5 distributed training framework providing ZeRO optimizer stages for memory-efficient training of InternVL models from 1B to 108B parameters.
Description
DeepSpeed is used as the primary distributed training backend for all InternVL training workflows. It provides ZeRO (Zero Redundancy Optimizer) memory optimization at three stages: Stage 1 (optimizer state partitioning), Stage 2 (+ gradient partitioning), and Stage 3 (+ parameter partitioning). Different ZeRO stages are selected based on model size: smaller models (1B-8B) use Stage 1, while larger models (26B-108B) use Stage 3. DeepSpeed also handles distributed initialization via `deepspeed.init_distributed()`.
Usage
Use this environment for all training workflows (SFT, LoRA, pretraining, MPO/DPO). DeepSpeed is initialized at the start of every training script and manages distributed communication, memory optimization, and mixed-precision training. The DPO trainer includes special DeepSpeed preparation methods for handling reference models.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | Multi-GPU setup (1-512 GPUs) | Single-node or multi-node via SLURM/OpenMPI |
| Network | High-bandwidth GPU interconnect | NVLink/NVSwitch for intra-node; InfiniBand for inter-node |
| OS | Linux | DeepSpeed not supported on Windows/macOS |
Dependencies
Python Packages
- `deepspeed` == 0.13.5
- `torch` >= 2.0 (prerequisite)
- `accelerate` (for HuggingFace Trainer integration)
Configuration Files
Training shell scripts reference DeepSpeed JSON configs:
- `zero_stage1_config.json` (for models 1B-8B)
- `zero_stage3_config.json` (for models 26B-78B)
- `zero_stage3_config_34b.json` (for 34B-40B models)
- `zero_stage3_config_100b.json` (for 78B-108B models)
Credentials
No credentials required. Distributed environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT`) must be set by the launcher.
Quick Install
pip install deepspeed==0.13.5
Code Evidence
Distributed initialization from `dist_utils.py:6,48-51`:
import deepspeed
num_gpus = torch.cuda.device_count()
torch.cuda.set_device(rank % num_gpus)
deepspeed.init_distributed(dist_backend=backend)
DeepSpeed flag usage in training scripts (e.g., `internvl2_5_8b_dynamic_res_2nd_finetune_full.sh`):
--deepspeed internvl_chat/shell/zero_stage1_config.json
DPO Trainer DeepSpeed preparation from `trainer_dpo.py:10,178-198`:
import deepspeed
def _prepare_deepspeed(self, model):
# Custom DeepSpeed preparation for DPO reference model
deepspeed_plugin = self.accelerator.state.deepspeed_plugin
config_kwargs = {**deepspeed_plugin.deepspeed_config}
# ... configures ZeRO stage for reference model
Launcher selection from `internvl_chat_finetune.py:807-808`:
launcher = os.environ.get('LAUNCHER', 'slurm')
init_dist(launcher=launcher, backend='nccl')
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `DeepSpeed not found` | DeepSpeed package not installed | `pip install deepspeed==0.13.5` |
| `NCCL timeout` | Network issues in multi-node setup | Check `MASTER_ADDR`/`MASTER_PORT` and firewall rules |
| `CUDA OOM with ZeRO Stage 1` | Model too large for Stage 1 | Switch to ZeRO Stage 3 config |
| `RuntimeError: DeepSpeed ZeRO-3 is not compatible` | HuggingFace Trainer version conflict | Ensure `deepspeed==0.13.5` and `transformers==4.37.2` |
Compatibility Notes
- ZeRO Stage Selection: Shell scripts use Stage 1 for models up to 8B parameters and Stage 3 for 26B+ models. Using the wrong stage can cause OOM or excessive communication overhead.
- DeepSpeed must init before HfArgumentParser: A comment in `internvl_chat_finetune.py:806` states: "If use DeepSpeed zero3, init_dist must before HfArgumentParser." This ordering is critical.
- DPO Reference Model: The `MultimodalDPOTrainer` includes custom DeepSpeed preparation (`_prepare_deepspeed`) to handle the reference model separately from the policy model.