Implementation:Huggingface Datasets StreamingDownloadManager

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for lazily resolving download and extraction URLs in streaming mode provided by the HuggingFace Datasets library.

Description

StreamingDownloadManager is a drop-in replacement for the standard DownloadManager used when streaming=True. It exposes the same public API -- download(), extract(), download_and_extract(), iter_archive(), and iter_files() -- but none of these methods perform actual file downloads. Instead:

download(url_or_urls): Normalizes URLs by resolving relative paths against the base_path. Returns the URL string(s) unchanged (or resolved).
extract(url_or_urls): Detects the compression protocol (zip, gzip, bz2, xz, zstd) and prepends the appropriate fsspec protocol prefix using the :: chained filesystem separator. Raises NotImplementedError for TAR archives (use iter_archive instead).
download_and_extract(url_or_urls): Composes extract(download(url_or_urls)).
iter_archive(urlpath_or_buf): Returns an ArchiveIterable that yields (path, file_object) pairs from a remote archive.
iter_files(urlpaths): Returns a FilesIterable that yields individual file URL paths.

The class sets is_streaming = True as a class attribute, which dataset builders can check to adjust their behavior.

Usage

Use StreamingDownloadManager when implementing or debugging dataset builders that need to operate in streaming mode. It is automatically instantiated by the library when load_dataset(..., streaming=True) is called; direct instantiation is typically only needed in custom builder implementations.

Code Reference

Source Location

Repository: datasets
File: src/datasets/download/streaming_download_manager.py
Lines: L47-L219

Signature

class StreamingDownloadManager:
    is_streaming = True

    def __init__(
        self,
        dataset_name: Optional[str] = None,
        data_dir: Optional[str] = None,
        download_config: Optional[DownloadConfig] = None,
        base_path: Optional[str] = None,
    ):

Import

from datasets.download.streaming_download_manager import StreamingDownloadManager

I/O Contract

Inputs

Name	Type	Required	Description
dataset_name	`Optional[str]`	No	Name of the dataset being managed.
data_dir	`Optional[str]`	No	Path to the data directory (used as `manual_dir`).
download_config	`Optional[DownloadConfig]`	No	Configuration for download behavior (authentication, proxies, etc.).
base_path	`Optional[str]`	No	Base path for resolving relative URLs. Defaults to current working directory.

Outputs

Name	Type	Description
(instance)	`StreamingDownloadManager`	A download manager that transforms URLs lazily without downloading files.

Usage Examples

Basic Usage

from datasets.download.streaming_download_manager import StreamingDownloadManager

dl_manager = StreamingDownloadManager(
    base_path="https://huggingface.co/datasets/my_user/my_dataset/resolve/main"
)

# Resolve a download URL (no actual download occurs)
url = dl_manager.download("data/train.jsonl")
# Returns: "https://huggingface.co/datasets/my_user/my_dataset/resolve/main/data/train.jsonl"

# Resolve extraction (prepends protocol for compressed files)
extracted_url = dl_manager.extract("data/train.jsonl.gz")
# Returns: "gzip://train.jsonl::https://...data/train.jsonl.gz"

# Iterate over files in a remote archive
archive_url = dl_manager.download("data/train.tar.gz")
for filename, file_obj in dl_manager.iter_archive(archive_url):
    content = file_obj.read()

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Streaming_Download_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment