Implementation:NVIDIA DALI Wds2idx
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Indexing |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Command-line tool that creates an index file from a WebDataset tar archive, enabling random access for DALI's fn.readers.webdataset reader.
Description
This script implements the IndexCreator class for WebDataset tar archives, which produces a structured index file that maps sample basenames to the byte offsets, extensions, and sizes of their component files within the tar archive. Unlike the RecordIO indexer which maps simple key-to-offset pairs, this tool must handle WebDataset's multi-file-per-sample convention where a single sample consists of multiple files sharing a common basename but differing in extension (e.g., sample001.jpg, sample001.cls, sample001.json).
The class provides two data extraction backends. The primary backend (_get_data_tar) uses the GNU tar utility with --list --block-num and --verbose --list flags to efficiently extract file offsets, names, and sizes by running two parallel tar subprocess commands. The fallback backend (_get_data_tarfile) uses Python's built-in tarfile module when the tar utility is not available, though it is substantially slower. Both backends filter out non-regular-file entries (directories, symlinks, etc.) and calculate data offsets by adding the 512-byte tar header size to the block number.
The create_index method aggregates file entries by basename using the split_name static method, which splits filenames at the first dot after the last directory separator. The resulting index file begins with a version header (v1.2) followed by the total sample count, then one line per sample containing space-separated tuples of (extension, offset, size, name) for each component file. Hidden files (those with empty basenames or basenames ending in /) are excluded. The command-line interface accepts the archive path and an optional index output path (defaulting to the archive basename with .idx extension).
Usage
Run this tool before using DALI's WebDataset reader with random access. The generated index file should be provided to fn.readers.webdataset via the index_path parameter to enable efficient random shuffling and sharding.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: tools/wds2idx.py
- Lines: 1-225
Signature
class IndexCreator:
"""Reads Webdataset data format, and creates index file
that enables random access.
Parameters
----------
uri : str
Path to the archive file.
idx_path : str
Path to the index file, that will be created/overwritten.
"""
tar_block_size = 512
index_file_version = "v1.2"
def __init__(self, uri, idx_path, verbose=True): ...
def __enter__(self): ...
def __exit__(self, exc_type, exc_value, exc_traceback): ...
def open(self): ...
def close(self): ...
def reset(self): ...
@staticmethod
def split_name(filepath) -> tuple:
"""Splits the webdataset filepath into basename and extension."""
...
def _get_data_tar(self) -> generator:
"""Extract offsets/names/sizes using GNU tar utility."""
...
def _get_data_tarfile(self) -> generator:
"""Fallback: extract offsets/names/sizes using Python tarfile module."""
...
def create_index(self): ...
def parse_args(): ...
def main(): ...
Import
from tools.wds2idx import IndexCreator
# or run directly:
# python tools/wds2idx.py data/train.tar data/train.idx
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| archive | str (CLI arg) | Yes | Path to the WebDataset tar archive (.tar) to index |
| index | str (CLI arg) | No | Path to the output index (.idx) file (defaults to archive basename with .idx extension) |
| verbose | bool | No | Enable progress output (default True) |
Outputs
| Name | Type | Description |
|---|---|---|
| .idx file | text file | Structured index file with version header, sample count, and per-sample component entries containing extension, byte offset, size, and filename |
| Console output | text | Progress messages at regular intervals showing elapsed time, count, and processing stage (collect/index/done) |
Usage Examples
Command-line Usage
# Create an index file for a WebDataset tar archive
# $ python tools/wds2idx.py data/train.tar data/train.idx
# With auto-generated index path (data/train.idx)
# $ python tools/wds2idx.py data/train.tar
Programmatic Usage
from tools.wds2idx import IndexCreator
with IndexCreator("data/train.tar", "data/train.idx") as ic:
ic.create_index()
# The resulting index file looks like:
# v1.2 50000
# jpg 1024 45321 sample000.jpg cls 46848 1 sample000.cls
# jpg 47872 38912 sample001.jpg cls 87296 1 sample001.cls
# ...
Using the Index with DALI
import nvidia.dali as dali
@dali.pipeline_def(batch_size=64, num_threads=8, device_id=0)
def pipeline():
data = dali.fn.readers.webdataset(
paths=["data/train.tar"],
index_paths=["data/train.idx"],
ext=["jpg", "cls"],
random_shuffle=True,
)
images = dali.fn.decoders.image(data["jpg"], device="mixed")
labels = dali.fn.cast(data["cls"], dtype=dali.types.INT64)
return images, labels