Heuristic:OpenGVLab InternVL Packed Training Buffer Management

Knowledge Sources	OpenGVLab/InternVL InternVL packed training optimization
Domains	Optimization, Training, Deep_Learning
Last Updated	2026-02-07 14:00 GMT

Overview

Packed training uses greedy bin-packing with configurable buffer management (max_buffer_size=20, overflow at 50% capacity, min_active_tokens_ratio=1/256) to efficiently fill GPU batches.

Description

InternVL's packed training system concatenates multiple short sequences into a single long sequence (up to `max_packed_tokens=8192` tokens) to reduce padding waste. The packing algorithm maintains a buffer of partially-filled sequences and uses a greedy approach to find the best-fit buffer for each new sample. Key parameters control buffer behavior: `max_buffer_size` limits memory usage, `allow_overflow` permits using half-full buffers when resources are tight, and `min_active_tokens_ratio` (1/256) filters out samples that are almost entirely padding/ignored tokens.

Usage

Apply this heuristic when using packed sequence training (via the `PackedDataset` class). Key parameters to tune:

`--max_packed_tokens 8192` (max tokens per packed sequence)
`--max_buffer_size 20` (max partially-filled sequences in memory)
`--num_images_expected 40` (max images per packed sequence)
`--strict_mode True` (pad to exact image count)

The Insight (Rule of Thumb)

Action: Use `max_buffer_size=20` and `max_packed_tokens=8192` for efficient packing.
Value: Minimum active token ratio = 1/256 (~0.39%) for sample validity. Overflow threshold at 50% buffer capacity.
Trade-off: Larger buffers improve packing efficiency but increase memory. Overflow allows faster throughput at the cost of slightly underfilled batches.

Reasoning

Without packing, short sequences waste GPU cycles on padding tokens. The greedy bin-packing algorithm finds the buffer closest to full for each new sample, minimizing wasted tokens. The 1/256 validity threshold prevents degenerate samples where almost all tokens are masked (IGNORE_TOKEN_ID=-100), which would contribute no useful gradient. The 50% overflow threshold is a practical trade-off: when buffers start filling up, accepting slightly underfilled sequences prevents deadlocks and improves throughput.

The buffer also handles image splitting carefully: sequences cannot be cut in the middle of an image token span (between IMG_START_TOKEN and IMG_END_TOKEN), ensuring visual integrity.

Code Evidence

Validity check from `dataset_packed.py:247-250`:

@staticmethod
def check_valid(sample_to_check, min_active_tokens_ratio=1/256):
    num_ignore_tokens = (sample_to_check['labels'] == IGNORE_TOKEN_ID).sum()
    num_tokens = sample_to_check['labels'].numel()
    return (1 - num_ignore_tokens / num_tokens) > min_active_tokens_ratio

Overflow threshold from `dataset_packed.py:226-228`:

if self.allow_overflow and len(buffer_list) >= self.max_buffer_size // 2:
    find = True
    find_idx = buffer_idx

Dataset limit auto-correction from `dataset_packed.py:115-120`:

if ds.max_num_images > self.num_images_expected:
    logger.warning(f'{ds.max_num_images=} of {ds.ds_name} is larger '
                   f'than {self.num_images_expected=}')
    ds.max_num_images = num_images_expected
if ds.max_tokens > self.max_packed_tokens:
    logger.warning(f'{ds.max_tokens=} of {ds.ds_name} is larger '
                   f'than {self.max_packed_tokens=}')
    ds.max_tokens = self.max_packed_tokens

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment