Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenGVLab InternVL Gradient Checkpointing Memory

From Leeroopedia




Knowledge Sources
Domains Optimization, Deep_Learning, Training
Last Updated 2026-02-07 14:00 GMT

Overview

Enable gradient checkpointing on the vision encoder by default and optionally on the language model to reduce VRAM usage during training, trading compute for memory.

Description

InternVL enables gradient checkpointing on the InternViT vision encoder unconditionally in all training scripts. For the language model, gradient checkpointing is controlled by the `--grad_checkpoint` flag (default: True). Additionally, `use_cache` is always disabled during training since KV caching is only useful for autoregressive inference, not training. This combination significantly reduces peak VRAM by not storing intermediate activations.

Usage

This heuristic is applied automatically in all InternVL training scripts. The vision encoder always has gradient checkpointing enabled. Use `--grad_checkpoint True` (default) to also enable it for the language model. Only disable gradient checkpointing if you have excess VRAM and need faster training.

The Insight (Rule of Thumb)

  • Action: Always set `model.vision_model.gradient_checkpointing = True` and `model.language_model.config.use_cache = False` during training.
  • Value: Reduces VRAM usage by 40-60% depending on model size.
  • Trade-off: ~20-30% slower training due to activation recomputation during the backward pass.

Reasoning

The InternViT-6B vision encoder processes high-resolution images with many tiles, generating large activation tensors. Without gradient checkpointing, storing these activations for backpropagation can consume the majority of GPU memory. By recomputing activations during the backward pass instead of storing them, the peak memory usage is dramatically reduced. The `use_cache = False` setting is critical because KV caching is designed for autoregressive generation, not training; leaving it enabled wastes memory on cache tensors that are never used.

Code Evidence

From `internvl_chat_finetune.py:975-979`:

model.language_model.config.use_cache = False
model.vision_model.gradient_checkpointing = True
model.vision_model.encoder.gradient_checkpointing = True
if model_args.grad_checkpoint:
    model.language_model._set_gradient_checkpointing()

Default value from ModelArguments dataclass:

grad_checkpoint: Optional[bool] = field(
    default=True,
    metadata={'help': 'Set to True to use gradient checkpointing.'}
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment