Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:NVIDIA DALI Last Batch Policy Selection

From Leeroopedia





Knowledge Sources
Domains Deep_Learning, Optimization
Last Updated 2026-02-08 16:00 GMT

Overview

Selection guide for DALI's three last-batch policies (FILL, DROP, PARTIAL) that control how incomplete final batches are handled at epoch boundaries.

Description

When a dataset size is not evenly divisible by the batch size, the final batch in an epoch is incomplete. DALI provides three policies to handle this: FILL (pad the batch by repeating samples), DROP (discard the incomplete batch), and PARTIAL (return the batch with fewer samples). The choice affects training accuracy, distributed training correctness, and evaluation metrics. For distributed training with sharding, the policy interacts with `pad_last_batch` in the reader.

Usage

Use this heuristic when configuring DALI iterators (`DALIClassificationIterator`, `DALIGenericIterator`) for training or validation. Particularly critical for distributed training where all ranks must process the same number of batches.

The Insight (Rule of Thumb)

  • Action: Choose `last_batch_policy` based on your use case:
    • Training (single GPU): `LastBatchPolicy.FILL` (default) or `LastBatchPolicy.DROP` to maintain consistent batch sizes.
    • Training (distributed): `LastBatchPolicy.PARTIAL` + `pad_last_batch=True` in the reader to ensure all ranks process the same number of batches.
    • Validation/Evaluation: `LastBatchPolicy.PARTIAL` to avoid counting duplicate samples in accuracy metrics.
  • Value: The ResNet50 example uses `LastBatchPolicy.PARTIAL` with `pad_last_batch=True` and `reader_name="Reader"` for auto-calculated shard sizes.
  • Trade-off: FILL inflates the effective dataset size; DROP loses samples; PARTIAL requires handling variable-size batches.

Reasoning

In distributed training with N GPUs, the dataset is sharded into N non-overlapping partitions. If the shard sizes differ (dataset not evenly divisible), some ranks finish earlier than others, causing NCCL `AllReduce` to hang waiting for the finished rank. Setting `pad_last_batch=True` ensures all shards have the same number of samples by repeating the last sample. Combined with `LastBatchPolicy.PARTIAL`, this ensures all ranks produce the same number of batches while the incomplete final batch is clearly marked as partial (not padded with duplicates).

For evaluation, FILL would count padded samples in accuracy metrics, giving inflated numbers. PARTIAL returns the exact remaining samples, ensuring correct metric computation.

Code Evidence

From `docs/examples/use_cases/pytorch/resnet50/main.py:312-331`:

pipe = create_dali_pipeline(
    batch_size=batch_size,
    shard_id=args.local_rank,
    num_shards=args.world_size,
    pad_last_batch=True,       # Ensure all shards have same number of samples
    is_training=True,
)
pipe.build()

train_loader = DALIClassificationIterator(
    pipe,
    reader_name="Reader",       # Auto-detect shard size for correct epoch length
    last_batch_policy=LastBatchPolicy.PARTIAL,
    auto_reset=True,
)

Last batch policy definitions from `dali/python/nvidia/dali/plugin/base_iterator.py:37-51`:

class LastBatchPolicy(Enum):
    FILL = 0     # Pad incomplete batch by repeating/wrapping samples
    DROP = 1     # Discard the incomplete last batch
    PARTIAL = 2  # Return incomplete batch as-is (fewer samples)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment