Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA DALI Wds2idx

From Leeroopedia


Knowledge Sources
Domains Data_Loading, Indexing
Last Updated 2026-02-08 16:00 GMT

Overview

Command-line tool that creates an index file from a WebDataset tar archive, enabling random access for DALI's fn.readers.webdataset reader.

Description

This script implements the IndexCreator class for WebDataset tar archives, which produces a structured index file that maps sample basenames to the byte offsets, extensions, and sizes of their component files within the tar archive. Unlike the RecordIO indexer which maps simple key-to-offset pairs, this tool must handle WebDataset's multi-file-per-sample convention where a single sample consists of multiple files sharing a common basename but differing in extension (e.g., sample001.jpg, sample001.cls, sample001.json).

The class provides two data extraction backends. The primary backend (_get_data_tar) uses the GNU tar utility with --list --block-num and --verbose --list flags to efficiently extract file offsets, names, and sizes by running two parallel tar subprocess commands. The fallback backend (_get_data_tarfile) uses Python's built-in tarfile module when the tar utility is not available, though it is substantially slower. Both backends filter out non-regular-file entries (directories, symlinks, etc.) and calculate data offsets by adding the 512-byte tar header size to the block number.

The create_index method aggregates file entries by basename using the split_name static method, which splits filenames at the first dot after the last directory separator. The resulting index file begins with a version header (v1.2) followed by the total sample count, then one line per sample containing space-separated tuples of (extension, offset, size, name) for each component file. Hidden files (those with empty basenames or basenames ending in /) are excluded. The command-line interface accepts the archive path and an optional index output path (defaulting to the archive basename with .idx extension).

Usage

Run this tool before using DALI's WebDataset reader with random access. The generated index file should be provided to fn.readers.webdataset via the index_path parameter to enable efficient random shuffling and sharding.

Code Reference

Source Location

Signature

class IndexCreator:
    """Reads Webdataset data format, and creates index file
    that enables random access.

    Parameters
    ----------
    uri : str
        Path to the archive file.
    idx_path : str
        Path to the index file, that will be created/overwritten.
    """

    tar_block_size = 512
    index_file_version = "v1.2"

    def __init__(self, uri, idx_path, verbose=True): ...
    def __enter__(self): ...
    def __exit__(self, exc_type, exc_value, exc_traceback): ...
    def open(self): ...
    def close(self): ...
    def reset(self): ...

    @staticmethod
    def split_name(filepath) -> tuple:
        """Splits the webdataset filepath into basename and extension."""
        ...

    def _get_data_tar(self) -> generator:
        """Extract offsets/names/sizes using GNU tar utility."""
        ...

    def _get_data_tarfile(self) -> generator:
        """Fallback: extract offsets/names/sizes using Python tarfile module."""
        ...

    def create_index(self): ...

def parse_args(): ...
def main(): ...

Import

from tools.wds2idx import IndexCreator
# or run directly:
# python tools/wds2idx.py data/train.tar data/train.idx

I/O Contract

Inputs

Name Type Required Description
archive str (CLI arg) Yes Path to the WebDataset tar archive (.tar) to index
index str (CLI arg) No Path to the output index (.idx) file (defaults to archive basename with .idx extension)
verbose bool No Enable progress output (default True)

Outputs

Name Type Description
.idx file text file Structured index file with version header, sample count, and per-sample component entries containing extension, byte offset, size, and filename
Console output text Progress messages at regular intervals showing elapsed time, count, and processing stage (collect/index/done)

Usage Examples

Command-line Usage

# Create an index file for a WebDataset tar archive
# $ python tools/wds2idx.py data/train.tar data/train.idx

# With auto-generated index path (data/train.idx)
# $ python tools/wds2idx.py data/train.tar

Programmatic Usage

from tools.wds2idx import IndexCreator

with IndexCreator("data/train.tar", "data/train.idx") as ic:
    ic.create_index()

# The resulting index file looks like:
# v1.2 50000
# jpg 1024 45321 sample000.jpg cls 46848 1 sample000.cls
# jpg 47872 38912 sample001.jpg cls 87296 1 sample001.cls
# ...

Using the Index with DALI

import nvidia.dali as dali

@dali.pipeline_def(batch_size=64, num_threads=8, device_id=0)
def pipeline():
    data = dali.fn.readers.webdataset(
        paths=["data/train.tar"],
        index_paths=["data/train.idx"],
        ext=["jpg", "cls"],
        random_shuffle=True,
    )
    images = dali.fn.decoders.image(data["jpg"], device="mixed")
    labels = dali.fn.cast(data["cls"], dtype=dali.types.INT64)
    return images, labels

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment