Heuristic:NVIDIA DALI Last Batch Policy Selection
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Selection guide for DALI's three last-batch policies (FILL, DROP, PARTIAL) that control how incomplete final batches are handled at epoch boundaries.
Description
When a dataset size is not evenly divisible by the batch size, the final batch in an epoch is incomplete. DALI provides three policies to handle this: FILL (pad the batch by repeating samples), DROP (discard the incomplete batch), and PARTIAL (return the batch with fewer samples). The choice affects training accuracy, distributed training correctness, and evaluation metrics. For distributed training with sharding, the policy interacts with `pad_last_batch` in the reader.
Usage
Use this heuristic when configuring DALI iterators (`DALIClassificationIterator`, `DALIGenericIterator`) for training or validation. Particularly critical for distributed training where all ranks must process the same number of batches.
The Insight (Rule of Thumb)
- Action: Choose `last_batch_policy` based on your use case:
- Training (single GPU): `LastBatchPolicy.FILL` (default) or `LastBatchPolicy.DROP` to maintain consistent batch sizes.
- Training (distributed): `LastBatchPolicy.PARTIAL` + `pad_last_batch=True` in the reader to ensure all ranks process the same number of batches.
- Validation/Evaluation: `LastBatchPolicy.PARTIAL` to avoid counting duplicate samples in accuracy metrics.
- Value: The ResNet50 example uses `LastBatchPolicy.PARTIAL` with `pad_last_batch=True` and `reader_name="Reader"` for auto-calculated shard sizes.
- Trade-off: FILL inflates the effective dataset size; DROP loses samples; PARTIAL requires handling variable-size batches.
Reasoning
In distributed training with N GPUs, the dataset is sharded into N non-overlapping partitions. If the shard sizes differ (dataset not evenly divisible), some ranks finish earlier than others, causing NCCL `AllReduce` to hang waiting for the finished rank. Setting `pad_last_batch=True` ensures all shards have the same number of samples by repeating the last sample. Combined with `LastBatchPolicy.PARTIAL`, this ensures all ranks produce the same number of batches while the incomplete final batch is clearly marked as partial (not padded with duplicates).
For evaluation, FILL would count padded samples in accuracy metrics, giving inflated numbers. PARTIAL returns the exact remaining samples, ensuring correct metric computation.
Code Evidence
From `docs/examples/use_cases/pytorch/resnet50/main.py:312-331`:
pipe = create_dali_pipeline(
batch_size=batch_size,
shard_id=args.local_rank,
num_shards=args.world_size,
pad_last_batch=True, # Ensure all shards have same number of samples
is_training=True,
)
pipe.build()
train_loader = DALIClassificationIterator(
pipe,
reader_name="Reader", # Auto-detect shard size for correct epoch length
last_batch_policy=LastBatchPolicy.PARTIAL,
auto_reset=True,
)
Last batch policy definitions from `dali/python/nvidia/dali/plugin/base_iterator.py:37-51`:
class LastBatchPolicy(Enum):
FILL = 0 # Pad incomplete batch by repeating/wrapping samples
DROP = 1 # Discard the incomplete last batch
PARTIAL = 2 # Return incomplete batch as-is (fewer samples)