Heuristic:OpenGVLab InternVL Gradient Checkpointing Memory
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning, Training |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Enable gradient checkpointing on the vision encoder by default and optionally on the language model to reduce VRAM usage during training, trading compute for memory.
Description
InternVL enables gradient checkpointing on the InternViT vision encoder unconditionally in all training scripts. For the language model, gradient checkpointing is controlled by the `--grad_checkpoint` flag (default: True). Additionally, `use_cache` is always disabled during training since KV caching is only useful for autoregressive inference, not training. This combination significantly reduces peak VRAM by not storing intermediate activations.
Usage
This heuristic is applied automatically in all InternVL training scripts. The vision encoder always has gradient checkpointing enabled. Use `--grad_checkpoint True` (default) to also enable it for the language model. Only disable gradient checkpointing if you have excess VRAM and need faster training.
The Insight (Rule of Thumb)
- Action: Always set `model.vision_model.gradient_checkpointing = True` and `model.language_model.config.use_cache = False` during training.
- Value: Reduces VRAM usage by 40-60% depending on model size.
- Trade-off: ~20-30% slower training due to activation recomputation during the backward pass.
Reasoning
The InternViT-6B vision encoder processes high-resolution images with many tiles, generating large activation tensors. Without gradient checkpointing, storing these activations for backpropagation can consume the majority of GPU memory. By recomputing activations during the backward pass instead of storing them, the peak memory usage is dramatically reduced. The `use_cache = False` setting is critical because KV caching is designed for autoregressive generation, not training; leaving it enabled wastes memory on cache tensors that are never used.
Code Evidence
From `internvl_chat_finetune.py:975-979`:
model.language_model.config.use_cache = False
model.vision_model.gradient_checkpointing = True
model.vision_model.encoder.gradient_checkpointing = True
if model_args.grad_checkpoint:
model.language_model._set_gradient_checkpointing()
Default value from ModelArguments dataclass:
grad_checkpoint: Optional[bool] = field(
default=True,
metadata={'help': 'Set to True to use gradient checkpointing.'}
)