Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets WebDataset Builder

From Leeroopedia
Source src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130)
Domain(s) Data_Loading, Web_Data
Last Updated 2026-02-14

Overview

Description

WebDataset is a packaged dataset builder (subclass of GeneratorBasedBuilder) in the HuggingFace Datasets library for loading datasets stored in the WebDataset TAR archive format. The WebDataset format is a convention where related files (e.g., an image and its label) are grouped in TAR archives with a shared basename prefix and different file extensions.

The builder parses TAR archives by splitting filenames into a base key and extension, grouping files with the same key into a single example. It performs automatic feature type detection by:

  • Inferring Arrow schema from the first few examples (configurable via NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5).
  • Detecting image files by extension and assigning datasets.Image() features.
  • Detecting audio files by extension and assigning datasets.Audio() features.
  • Detecting video files by extension and assigning datasets.Video() features.

The builder also includes a set of built-in decoders for common file types: text, JSON, NumPy arrays (.npy, .npz), MessagePack, CBOR, PyTorch tensors, and integer class labels. Pickle-based decoders are intentionally excluded for security. Compressed files within the TAR archive are automatically decompressed.

Each example includes two metadata fields: __key__ (the shared basename) and __url__ (the source TAR archive path).

Usage

Use the WebDataset builder when loading datasets distributed as TAR archives following the WebDataset naming convention. Common scenarios include:

  • Loading large-scale image-text datasets (e.g., LAION, CC12M) distributed as TAR shards.
  • Streaming multimodal datasets from the HuggingFace Hub or remote storage.
  • Working with datasets originally created for the webdataset Python library.

Code Reference

Source Location

Repository: huggingface/datasets

File: src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130)

Signature

class WebDataset(datasets.GeneratorBasedBuilder):
    DEFAULT_WRITER_BATCH_SIZE = 100
    IMAGE_EXTENSIONS: list[str]
    AUDIO_EXTENSIONS: list[str]
    VIDEO_EXTENSIONS: list[str]
    DECODERS: dict[str, Callable[[Any], Any]]
    NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5

Key Methods:

  • _get_pipeline_from_tar(cls, tar_path, tar_iterator) (classmethod) -- Parses a TAR archive iterator, groups files by shared basename prefix into examples, applies decoders based on file extension, handles compressed entries, and yields example dictionaries with __key__ and __url__ metadata.
  • _info(self) -> DatasetInfo -- Returns an empty DatasetInfo (features are inferred during split generation).
  • _split_generators(self, dl_manager) -> list[SplitGenerator] -- Downloads TAR files, infers features from the first few examples by detecting image/audio/video extensions and Arrow schema promotion, and returns split generators with TAR paths and iterators.
  • _generate_examples(self, tar_paths, tar_iterators) -> Iterator[tuple[Key, dict]] -- Iterates over all TAR archives, yields decoded examples keyed by (tar_idx, example_idx), wraps image/audio fields in {"path": ..., "bytes": ...} dicts, and fills missing fields with None.

Import

# Not imported directly; used via load_dataset
from datasets import load_dataset

ds = load_dataset("webdataset", data_files="path/to/archive.tar")

I/O Contract

Inputs

Parameter Type Description
data_files str or list[str] or dict[str, str/list] Path(s) to TAR archive files. Supports glob patterns and split mappings.

TAR archive format requirements:

Requirement Description
File naming Files within the TAR must follow the pattern {key}.{extension} where files sharing the same {key} are grouped into one example.
Consistent structure All examples must have the same set of extensions (validated against the first 5 examples).
Supported extensions Images (PNG, JPG, WEBP, etc.), audio (WAV, MP3, FLAC, etc.), video (MP4, MKV, AVI, etc.), text, JSON, NumPy, MessagePack, CBOR, PyTorch tensors.

Outputs

Output Type Description
Examples dict Each example is a dictionary with one key per file extension, plus __key__ (basename) and __url__ (TAR path).
Image fields datasets.Image Automatically detected image files are wrapped as {"path": ..., "bytes": ...} and decoded via the Image feature.
Audio fields datasets.Audio Automatically detected audio files are wrapped as {"path": ..., "bytes": ...} and decoded via the Audio feature.
Video fields datasets.Video Automatically detected video files are wrapped and decoded via the Video feature.
Text/JSON/numeric fields Decoded Python objects Text is decoded to str, JSON to Python objects, NumPy to arrays, etc.

Usage Examples

Loading a WebDataset from TAR archives

from datasets import load_dataset

# Load a WebDataset with image-text pairs
ds = load_dataset("webdataset", data_files="shards/shard-{0000..0099}.tar")
print(ds["train"].features)
# e.g., {'jpg': Image(), 'txt': Value(dtype='string'), '__key__': Value(dtype='string'), '__url__': Value(dtype='string')}

example = ds["train"][0]
print(example["__key__"])   # e.g., "00000001"
print(example["txt"])       # e.g., "A photo of a cat"
example["jpg"].show()       # Display the decoded image

Loading with explicit train/test splits

from datasets import load_dataset

ds = load_dataset("webdataset", data_files={
    "train": "data/train-*.tar",
    "test": "data/test-*.tar",
})
print(f"Train size: {len(ds['train'])}")
print(f"Test size: {len(ds['test'])}")

Streaming a large WebDataset

from datasets import load_dataset

# Stream without downloading the full dataset
ds = load_dataset(
    "webdataset",
    data_files="https://huggingface.co/datasets/user/repo/resolve/main/data/*.tar",
    streaming=True,
)
for example in ds["train"]:
    print(example["__key__"], type(example["jpg"]))
    break

Related Pages

Principles

  • WebDataset Building -- Principle for loading TAR-based WebDataset archives with automatic feature detection and multimodal decoding.

Environments

  • Huggingface Datasets -- The parent library providing the dataset builder infrastructure.
  • Web Data -- Domain context for web-scale data formats and distribution.

Related Implementations

  • GeneratorBasedBuilder -- The base class providing the example-by-example generation infrastructure.
  • Image, Audio, Video -- Feature types automatically assigned to detected media files.
  • StreamingDownloadManager -- Used internally for handling compressed file extraction within TAR entries.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment