Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA DALI File Reading

From Leeroopedia


Knowledge Sources
Domains Data_Pipeline, File_IO, Distributed_Computing
Last Updated 2026-02-08 00:00 GMT

Overview

Efficient, shard-aware file reading from disk as the entry point of a DALI data pipeline, providing parallel I/O with built-in support for distributed training, shuffling, and epoch management.

Description

File reading in DALI serves as the data source operator that feeds raw encoded data (such as JPEG byte buffers) into the rest of the preprocessing pipeline. Unlike traditional data loaders that rely on Python-level iteration and the GIL-constrained multiprocessing model, DALI's file reader operates at the C++ level with native threading, bypassing Python overhead entirely.

The file reader is designed for datasets organized in an ImageNet-style directory structure, where each subdirectory under the root represents a class label. It automatically discovers files, assigns integer labels based on sorted directory names, and provides both the raw file contents and the corresponding labels as output DataNodes.

A critical aspect of the file reader is its native support for data sharding in distributed training. Each reader instance is configured with a shard_id and num_shards, ensuring that each GPU process reads a disjoint subset of the data without requiring external coordination. Combined with random_shuffle for training randomization and pad_last_batch for uniform batch sizes across shards, the reader handles the full complexity of distributed data partitioning within a single operator.

The reader also exposes a name parameter that serves as a handle for querying epoch progress and dataset size from the pipeline, which is essential for integration with iterator wrappers like the DALIClassificationIterator.

Usage

Use this principle when:

  • Loading image datasets organized in ImageNet-style class subdirectories (one folder per class)
  • Setting up data loading for multi-GPU or multi-node distributed training where each process must read a unique data shard
  • Needing per-epoch shuffling of training data without external shuffle logic
  • Requiring consistent batch sizes across all distributed workers via last-batch padding
  • Building a DALI pipeline that needs to report epoch size and progress to an external training loop

Theoretical Basis

Sharded reading partitions the dataset across N workers such that each worker processes approximately 1/N of the data. This is the standard approach for data-parallel distributed training, where each GPU trains on different data but synchronizes gradients. The file reader implements this partitioning at the I/O level, which is more efficient than reading all data and discarding non-local samples.

Shuffling at the reader level ensures that each epoch presents samples in a different random order, which is important for stochastic gradient descent convergence. By performing shuffling within the native reader rather than in Python, the randomization does not add Python-level overhead.

Last-batch padding addresses the common problem of uneven data partitioning: when the dataset size is not evenly divisible by the number of shards times the batch size, some shards would produce fewer samples in the final batch. Padding ensures all shards produce identically-sized batches, which is required for synchronized distributed training where all workers must execute the same number of iterations per epoch.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment