Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets StreamingDownloadManager

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for lazily resolving download and extraction URLs in streaming mode provided by the HuggingFace Datasets library.

Description

StreamingDownloadManager is a drop-in replacement for the standard DownloadManager used when streaming=True. It exposes the same public API -- download(), extract(), download_and_extract(), iter_archive(), and iter_files() -- but none of these methods perform actual file downloads. Instead:

  • download(url_or_urls): Normalizes URLs by resolving relative paths against the base_path. Returns the URL string(s) unchanged (or resolved).
  • extract(url_or_urls): Detects the compression protocol (zip, gzip, bz2, xz, zstd) and prepends the appropriate fsspec protocol prefix using the :: chained filesystem separator. Raises NotImplementedError for TAR archives (use iter_archive instead).
  • download_and_extract(url_or_urls): Composes extract(download(url_or_urls)).
  • iter_archive(urlpath_or_buf): Returns an ArchiveIterable that yields (path, file_object) pairs from a remote archive.
  • iter_files(urlpaths): Returns a FilesIterable that yields individual file URL paths.

The class sets is_streaming = True as a class attribute, which dataset builders can check to adjust their behavior.

Usage

Use StreamingDownloadManager when implementing or debugging dataset builders that need to operate in streaming mode. It is automatically instantiated by the library when load_dataset(..., streaming=True) is called; direct instantiation is typically only needed in custom builder implementations.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/download/streaming_download_manager.py
  • Lines: L47-L219

Signature

class StreamingDownloadManager:
    is_streaming = True

    def __init__(
        self,
        dataset_name: Optional[str] = None,
        data_dir: Optional[str] = None,
        download_config: Optional[DownloadConfig] = None,
        base_path: Optional[str] = None,
    ):

Import

from datasets.download.streaming_download_manager import StreamingDownloadManager

I/O Contract

Inputs

Name Type Required Description
dataset_name Optional[str] No Name of the dataset being managed.
data_dir Optional[str] No Path to the data directory (used as manual_dir).
download_config Optional[DownloadConfig] No Configuration for download behavior (authentication, proxies, etc.).
base_path Optional[str] No Base path for resolving relative URLs. Defaults to current working directory.

Outputs

Name Type Description
(instance) StreamingDownloadManager A download manager that transforms URLs lazily without downloading files.

Usage Examples

Basic Usage

from datasets.download.streaming_download_manager import StreamingDownloadManager

dl_manager = StreamingDownloadManager(
    base_path="https://huggingface.co/datasets/my_user/my_dataset/resolve/main"
)

# Resolve a download URL (no actual download occurs)
url = dl_manager.download("data/train.jsonl")
# Returns: "https://huggingface.co/datasets/my_user/my_dataset/resolve/main/data/train.jsonl"

# Resolve extraction (prepends protocol for compressed files)
extracted_url = dl_manager.extract("data/train.jsonl.gz")
# Returns: "gzip://train.jsonl::https://...data/train.jsonl.gz"

# Iterate over files in a remote archive
archive_url = dl_manager.download("data/train.tar.gz")
for filename, file_obj in dl_manager.iter_archive(archive_url):
    content = file_obj.read()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment