Implementation:Huggingface Datasets WebDataset Builder

Source	src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130)
Domain(s)	Data_Loading, Web_Data
Last Updated	2026-02-14

Overview

Description

WebDataset is a packaged dataset builder (subclass of GeneratorBasedBuilder) in the HuggingFace Datasets library for loading datasets stored in the WebDataset TAR archive format. The WebDataset format is a convention where related files (e.g., an image and its label) are grouped in TAR archives with a shared basename prefix and different file extensions.

The builder parses TAR archives by splitting filenames into a base key and extension, grouping files with the same key into a single example. It performs automatic feature type detection by:

Inferring Arrow schema from the first few examples (configurable via NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5).
Detecting image files by extension and assigning datasets.Image() features.
Detecting audio files by extension and assigning datasets.Audio() features.
Detecting video files by extension and assigning datasets.Video() features.

The builder also includes a set of built-in decoders for common file types: text, JSON, NumPy arrays (.npy, .npz), MessagePack, CBOR, PyTorch tensors, and integer class labels. Pickle-based decoders are intentionally excluded for security. Compressed files within the TAR archive are automatically decompressed.

Each example includes two metadata fields: __key__ (the shared basename) and __url__ (the source TAR archive path).

Usage

Use the WebDataset builder when loading datasets distributed as TAR archives following the WebDataset naming convention. Common scenarios include:

Loading large-scale image-text datasets (e.g., LAION, CC12M) distributed as TAR shards.
Streaming multimodal datasets from the HuggingFace Hub or remote storage.
Working with datasets originally created for the webdataset Python library.

Code Reference

Source Location

Repository: huggingface/datasets

File: src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130)

Signature

class WebDataset(datasets.GeneratorBasedBuilder):
    DEFAULT_WRITER_BATCH_SIZE = 100
    IMAGE_EXTENSIONS: list[str]
    AUDIO_EXTENSIONS: list[str]
    VIDEO_EXTENSIONS: list[str]
    DECODERS: dict[str, Callable[[Any], Any]]
    NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5

Key Methods:

_get_pipeline_from_tar(cls, tar_path, tar_iterator) (classmethod) -- Parses a TAR archive iterator, groups files by shared basename prefix into examples, applies decoders based on file extension, handles compressed entries, and yields example dictionaries with __key__ and __url__ metadata.
_info(self) -> DatasetInfo -- Returns an empty DatasetInfo (features are inferred during split generation).
_split_generators(self, dl_manager) -> list[SplitGenerator] -- Downloads TAR files, infers features from the first few examples by detecting image/audio/video extensions and Arrow schema promotion, and returns split generators with TAR paths and iterators.
_generate_examples(self, tar_paths, tar_iterators) -> Iterator[tuple[Key, dict]] -- Iterates over all TAR archives, yields decoded examples keyed by (tar_idx, example_idx), wraps image/audio fields in {"path": ..., "bytes": ...} dicts, and fills missing fields with None.

Import

# Not imported directly; used via load_dataset
from datasets import load_dataset

ds = load_dataset("webdataset", data_files="path/to/archive.tar")

I/O Contract

Inputs

Parameter	Type	Description
`data_files`	`str` or `list[str]` or `dict[str, str/list]`	Path(s) to TAR archive files. Supports glob patterns and split mappings.

TAR archive format requirements:

Requirement	Description
File naming	Files within the TAR must follow the pattern `{key}.{extension}` where files sharing the same `{key}` are grouped into one example.
Consistent structure	All examples must have the same set of extensions (validated against the first 5 examples).
Supported extensions	Images (PNG, JPG, WEBP, etc.), audio (WAV, MP3, FLAC, etc.), video (MP4, MKV, AVI, etc.), text, JSON, NumPy, MessagePack, CBOR, PyTorch tensors.

Outputs

Output	Type	Description
Examples	`dict`	Each example is a dictionary with one key per file extension, plus `__key__` (basename) and `__url__` (TAR path).
Image fields	`datasets.Image`	Automatically detected image files are wrapped as `{"path": ..., "bytes": ...}` and decoded via the `Image` feature.
Audio fields	`datasets.Audio`	Automatically detected audio files are wrapped as `{"path": ..., "bytes": ...}` and decoded via the `Audio` feature.
Video fields	`datasets.Video`	Automatically detected video files are wrapped and decoded via the `Video` feature.
Text/JSON/numeric fields	Decoded Python objects	Text is decoded to `str`, JSON to Python objects, NumPy to arrays, etc.

Usage Examples

Loading a WebDataset from TAR archives

from datasets import load_dataset

# Load a WebDataset with image-text pairs
ds = load_dataset("webdataset", data_files="shards/shard-{0000..0099}.tar")
print(ds["train"].features)
# e.g., {'jpg': Image(), 'txt': Value(dtype='string'), '__key__': Value(dtype='string'), '__url__': Value(dtype='string')}

example = ds["train"][0]
print(example["__key__"])   # e.g., "00000001"
print(example["txt"])       # e.g., "A photo of a cat"
example["jpg"].show()       # Display the decoded image

Loading with explicit train/test splits

from datasets import load_dataset

ds = load_dataset("webdataset", data_files={
    "train": "data/train-*.tar",
    "test": "data/test-*.tar",
})
print(f"Train size: {len(ds['train'])}")
print(f"Test size: {len(ds['test'])}")

Streaming a large WebDataset

from datasets import load_dataset

# Stream without downloading the full dataset
ds = load_dataset(
    "webdataset",
    data_files="https://huggingface.co/datasets/user/repo/resolve/main/data/*.tar",
    streaming=True,
)
for example in ds["train"]:
    print(example["__key__"], type(example["jpg"]))
    break

Related Pages

Principles

WebDataset Building -- Principle for loading TAR-based WebDataset archives with automatic feature detection and multimodal decoding.

Environments

Huggingface Datasets -- The parent library providing the dataset builder infrastructure.
Web Data -- Domain context for web-scale data formats and distribution.

Related Implementations

GeneratorBasedBuilder -- The base class providing the example-by-example generation infrastructure.
Image, Audio, Video -- Feature types automatically assigned to detected media files.
StreamingDownloadManager -- Used internally for handling compressed file extraction within TAR entries.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment