Heuristic:NVIDIA DALI Batch Size Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Batch size selection and learning rate scaling heuristics for maximizing GPU utilization while avoiding OOM errors during DALI-accelerated training.
Description
Batch size is the primary knob controlling GPU memory usage and training throughput. DALI pipelines process data in fixed-size batches specified at pipeline creation time. The batch size interacts with prefetch queue depth (each buffered batch consumes memory), model size, and gradient accumulation. The ResNet50 example uses 256 as the default per-GPU batch size, with linear learning rate scaling based on the global batch size across all GPUs.
Usage
Use this heuristic when configuring DALI pipeline batch size for training or when encountering CUDA OOM errors. Also apply the learning rate scaling rule when changing batch size or GPU count in distributed training.
The Insight (Rule of Thumb)
- Action: Set `batch_size` to the largest value that fits in GPU memory without OOM. Start with 256 for ResNet50-class models on V100/A100 GPUs.
- Value: ResNet50 default: `batch_size=256` per GPU. Video pipelines: `batch_size=4` (much larger per-sample memory). Scale learning rate: `lr_new = lr_base * (batch_size * world_size) / 256`.
- Trade-off: Larger batches = higher GPU utilization and throughput, but higher memory usage and may require learning rate warmup. Smaller batches = less memory but lower utilization.
- Prefetch Impact: Each batch buffered in the prefetch queue consumes full batch memory. With `prefetch_queue_depth=2`, you need memory for 2x batch_size worth of decoded data.
Reasoning
GPU compute units are most efficient when operating on large, contiguous data. With DALI, the decoded image data sits in GPU memory, so batch_size directly determines the GPU memory footprint of the data loading pipeline. The standard reference batch size of 256 comes from the original ResNet paper and ImageNet training conventions. The linear learning rate scaling rule (`lr * batch_size * world_size / 256`) maintains convergence quality when scaling to multiple GPUs.
For video processing, the per-sample memory is much larger (multiple frames per sequence), requiring smaller batch sizes (e.g., 4) to fit in GPU memory.
Code Evidence
Default batch size from `docs/examples/use_cases/pytorch/resnet50/main.py:71`:
parser.add_argument('-b', '--batch-size', default=256, type=int,
metavar='N', help='mini-batch size per device (default: 256)')
Learning rate scaling from `docs/examples/use_cases/pytorch/resnet50/main.py:258`:
args.lr = args.lr * float(args.batch_size * args.world_size) / 256.
Video batch size from `docs/examples/sequence_processing/video/video_label_example.py:24`:
BATCH_SIZE = 4 # Much smaller due to per-sample memory (multiple frames)
Prefetch queue depth from `dali/python/nvidia/dali/pipeline.py:126-137`:
# prefetch_queue_depth=2 (default) means 2 batches buffered ahead
# Each batch consumes full batch_size worth of decoded data in GPU memory