Heuristic:OpenGVLab InternVL Packed Training Buffer Management
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training, Deep_Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Packed training uses greedy bin-packing with configurable buffer management (max_buffer_size=20, overflow at 50% capacity, min_active_tokens_ratio=1/256) to efficiently fill GPU batches.
Description
InternVL's packed training system concatenates multiple short sequences into a single long sequence (up to `max_packed_tokens=8192` tokens) to reduce padding waste. The packing algorithm maintains a buffer of partially-filled sequences and uses a greedy approach to find the best-fit buffer for each new sample. Key parameters control buffer behavior: `max_buffer_size` limits memory usage, `allow_overflow` permits using half-full buffers when resources are tight, and `min_active_tokens_ratio` (1/256) filters out samples that are almost entirely padding/ignored tokens.
Usage
Apply this heuristic when using packed sequence training (via the `PackedDataset` class). Key parameters to tune:
- `--max_packed_tokens 8192` (max tokens per packed sequence)
- `--max_buffer_size 20` (max partially-filled sequences in memory)
- `--num_images_expected 40` (max images per packed sequence)
- `--strict_mode True` (pad to exact image count)
The Insight (Rule of Thumb)
- Action: Use `max_buffer_size=20` and `max_packed_tokens=8192` for efficient packing.
- Value: Minimum active token ratio = 1/256 (~0.39%) for sample validity. Overflow threshold at 50% buffer capacity.
- Trade-off: Larger buffers improve packing efficiency but increase memory. Overflow allows faster throughput at the cost of slightly underfilled batches.
Reasoning
Without packing, short sequences waste GPU cycles on padding tokens. The greedy bin-packing algorithm finds the buffer closest to full for each new sample, minimizing wasted tokens. The 1/256 validity threshold prevents degenerate samples where almost all tokens are masked (IGNORE_TOKEN_ID=-100), which would contribute no useful gradient. The 50% overflow threshold is a practical trade-off: when buffers start filling up, accepting slightly underfilled sequences prevents deadlocks and improves throughput.
The buffer also handles image splitting carefully: sequences cannot be cut in the middle of an image token span (between IMG_START_TOKEN and IMG_END_TOKEN), ensuring visual integrity.
Code Evidence
Validity check from `dataset_packed.py:247-250`:
@staticmethod
def check_valid(sample_to_check, min_active_tokens_ratio=1/256):
num_ignore_tokens = (sample_to_check['labels'] == IGNORE_TOKEN_ID).sum()
num_tokens = sample_to_check['labels'].numel()
return (1 - num_ignore_tokens / num_tokens) > min_active_tokens_ratio
Overflow threshold from `dataset_packed.py:226-228`:
if self.allow_overflow and len(buffer_list) >= self.max_buffer_size // 2:
find = True
find_idx = buffer_idx
Dataset limit auto-correction from `dataset_packed.py:115-120`:
if ds.max_num_images > self.num_images_expected:
logger.warning(f'{ds.max_num_images=} of {ds.ds_name} is larger '
f'than {self.num_images_expected=}')
ds.max_num_images = num_images_expected
if ds.max_tokens > self.max_packed_tokens:
logger.warning(f'{ds.max_tokens=} of {ds.ds_name} is larger '
f'than {self.max_packed_tokens=}')
ds.max_tokens = self.max_packed_tokens