Heuristic:OpenGVLab InternVL Gradient Checkpointing Memory

Knowledge Sources	OpenGVLab/InternVL InternVL training memory optimization
Domains	Optimization, Deep_Learning, Training
Last Updated	2026-02-07 14:00 GMT

Overview

Enable gradient checkpointing on the vision encoder by default and optionally on the language model to reduce VRAM usage during training, trading compute for memory.

Description

InternVL enables gradient checkpointing on the InternViT vision encoder unconditionally in all training scripts. For the language model, gradient checkpointing is controlled by the `--grad_checkpoint` flag (default: True). Additionally, `use_cache` is always disabled during training since KV caching is only useful for autoregressive inference, not training. This combination significantly reduces peak VRAM by not storing intermediate activations.

Usage

This heuristic is applied automatically in all InternVL training scripts. The vision encoder always has gradient checkpointing enabled. Use `--grad_checkpoint True` (default) to also enable it for the language model. Only disable gradient checkpointing if you have excess VRAM and need faster training.

The Insight (Rule of Thumb)

Action: Always set `model.vision_model.gradient_checkpointing = True` and `model.language_model.config.use_cache = False` during training.
Value: Reduces VRAM usage by 40-60% depending on model size.
Trade-off: ~20-30% slower training due to activation recomputation during the backward pass.

Reasoning

The InternViT-6B vision encoder processes high-resolution images with many tiles, generating large activation tensors. Without gradient checkpointing, storing these activations for backpropagation can consume the majority of GPU memory. By recomputing activations during the backward pass instead of storing them, the peak memory usage is dramatically reduced. The `use_cache = False` setting is critical because KV caching is designed for autoregressive generation, not training; leaving it enabled wastes memory on cache tensors that are never used.

Code Evidence

From `internvl_chat_finetune.py:975-979`:

model.language_model.config.use_cache = False
model.vision_model.gradient_checkpointing = True
model.vision_model.encoder.gradient_checkpointing = True
if model_args.grad_checkpoint:
    model.language_model._set_gradient_checkpointing()

Default value from ModelArguments dataclass:

grad_checkpoint: Optional[bool] = field(
    default=True,
    metadata={'help': 'Set to True to use gradient checkpointing.'}
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment