Implementation:Huggingface Datasets WebDataset Builder
| Source | src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130) |
|---|---|
| Domain(s) | Data_Loading, Web_Data |
| Last Updated | 2026-02-14 |
Overview
Description
WebDataset is a packaged dataset builder (subclass of GeneratorBasedBuilder) in the HuggingFace Datasets library for loading datasets stored in the WebDataset TAR archive format. The WebDataset format is a convention where related files (e.g., an image and its label) are grouped in TAR archives with a shared basename prefix and different file extensions.
The builder parses TAR archives by splitting filenames into a base key and extension, grouping files with the same key into a single example. It performs automatic feature type detection by:
- Inferring Arrow schema from the first few examples (configurable via
NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5). - Detecting image files by extension and assigning
datasets.Image()features. - Detecting audio files by extension and assigning
datasets.Audio()features. - Detecting video files by extension and assigning
datasets.Video()features.
The builder also includes a set of built-in decoders for common file types: text, JSON, NumPy arrays (.npy, .npz), MessagePack, CBOR, PyTorch tensors, and integer class labels. Pickle-based decoders are intentionally excluded for security. Compressed files within the TAR archive are automatically decompressed.
Each example includes two metadata fields: __key__ (the shared basename) and __url__ (the source TAR archive path).
Usage
Use the WebDataset builder when loading datasets distributed as TAR archives following the WebDataset naming convention. Common scenarios include:
- Loading large-scale image-text datasets (e.g., LAION, CC12M) distributed as TAR shards.
- Streaming multimodal datasets from the HuggingFace Hub or remote storage.
- Working with datasets originally created for the
webdatasetPython library.
Code Reference
Source Location
Repository: huggingface/datasets
File: src/datasets/packaged_modules/webdataset/webdataset.py (lines 20-130)
Signature
class WebDataset(datasets.GeneratorBasedBuilder):
DEFAULT_WRITER_BATCH_SIZE = 100
IMAGE_EXTENSIONS: list[str]
AUDIO_EXTENSIONS: list[str]
VIDEO_EXTENSIONS: list[str]
DECODERS: dict[str, Callable[[Any], Any]]
NUM_EXAMPLES_FOR_FEATURES_INFERENCE = 5
Key Methods:
_get_pipeline_from_tar(cls, tar_path, tar_iterator)(classmethod) -- Parses a TAR archive iterator, groups files by shared basename prefix into examples, applies decoders based on file extension, handles compressed entries, and yields example dictionaries with__key__and__url__metadata._info(self) -> DatasetInfo-- Returns an emptyDatasetInfo(features are inferred during split generation)._split_generators(self, dl_manager) -> list[SplitGenerator]-- Downloads TAR files, infers features from the first few examples by detecting image/audio/video extensions and Arrow schema promotion, and returns split generators with TAR paths and iterators._generate_examples(self, tar_paths, tar_iterators) -> Iterator[tuple[Key, dict]]-- Iterates over all TAR archives, yields decoded examples keyed by(tar_idx, example_idx), wraps image/audio fields in{"path": ..., "bytes": ...}dicts, and fills missing fields withNone.
Import
# Not imported directly; used via load_dataset
from datasets import load_dataset
ds = load_dataset("webdataset", data_files="path/to/archive.tar")
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
data_files |
str or list[str] or dict[str, str/list] |
Path(s) to TAR archive files. Supports glob patterns and split mappings. |
TAR archive format requirements:
| Requirement | Description |
|---|---|
| File naming | Files within the TAR must follow the pattern {key}.{extension} where files sharing the same {key} are grouped into one example.
|
| Consistent structure | All examples must have the same set of extensions (validated against the first 5 examples). |
| Supported extensions | Images (PNG, JPG, WEBP, etc.), audio (WAV, MP3, FLAC, etc.), video (MP4, MKV, AVI, etc.), text, JSON, NumPy, MessagePack, CBOR, PyTorch tensors. |
Outputs
| Output | Type | Description |
|---|---|---|
| Examples | dict |
Each example is a dictionary with one key per file extension, plus __key__ (basename) and __url__ (TAR path).
|
| Image fields | datasets.Image |
Automatically detected image files are wrapped as {"path": ..., "bytes": ...} and decoded via the Image feature.
|
| Audio fields | datasets.Audio |
Automatically detected audio files are wrapped as {"path": ..., "bytes": ...} and decoded via the Audio feature.
|
| Video fields | datasets.Video |
Automatically detected video files are wrapped and decoded via the Video feature.
|
| Text/JSON/numeric fields | Decoded Python objects | Text is decoded to str, JSON to Python objects, NumPy to arrays, etc.
|
Usage Examples
Loading a WebDataset from TAR archives
from datasets import load_dataset
# Load a WebDataset with image-text pairs
ds = load_dataset("webdataset", data_files="shards/shard-{0000..0099}.tar")
print(ds["train"].features)
# e.g., {'jpg': Image(), 'txt': Value(dtype='string'), '__key__': Value(dtype='string'), '__url__': Value(dtype='string')}
example = ds["train"][0]
print(example["__key__"]) # e.g., "00000001"
print(example["txt"]) # e.g., "A photo of a cat"
example["jpg"].show() # Display the decoded image
Loading with explicit train/test splits
from datasets import load_dataset
ds = load_dataset("webdataset", data_files={
"train": "data/train-*.tar",
"test": "data/test-*.tar",
})
print(f"Train size: {len(ds['train'])}")
print(f"Test size: {len(ds['test'])}")
Streaming a large WebDataset
from datasets import load_dataset
# Stream without downloading the full dataset
ds = load_dataset(
"webdataset",
data_files="https://huggingface.co/datasets/user/repo/resolve/main/data/*.tar",
streaming=True,
)
for example in ds["train"]:
print(example["__key__"], type(example["jpg"]))
break
Related Pages
Principles
- WebDataset Building -- Principle for loading TAR-based WebDataset archives with automatic feature detection and multimodal decoding.
Environments
- Huggingface Datasets -- The parent library providing the dataset builder infrastructure.
- Web Data -- Domain context for web-scale data formats and distribution.
Related Implementations
GeneratorBasedBuilder-- The base class providing the example-by-example generation infrastructure.Image,Audio,Video-- Feature types automatically assigned to detected media files.StreamingDownloadManager-- Used internally for handling compressed file extraction within TAR entries.