Implementation:Huggingface Datasets StreamingDownloadManager
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for lazily resolving download and extraction URLs in streaming mode provided by the HuggingFace Datasets library.
Description
StreamingDownloadManager is a drop-in replacement for the standard DownloadManager used when streaming=True. It exposes the same public API -- download(), extract(), download_and_extract(), iter_archive(), and iter_files() -- but none of these methods perform actual file downloads. Instead:
download(url_or_urls): Normalizes URLs by resolving relative paths against thebase_path. Returns the URL string(s) unchanged (or resolved).extract(url_or_urls): Detects the compression protocol (zip, gzip, bz2, xz, zstd) and prepends the appropriatefsspecprotocol prefix using the::chained filesystem separator. RaisesNotImplementedErrorfor TAR archives (useiter_archiveinstead).download_and_extract(url_or_urls): Composesextract(download(url_or_urls)).iter_archive(urlpath_or_buf): Returns anArchiveIterablethat yields(path, file_object)pairs from a remote archive.iter_files(urlpaths): Returns aFilesIterablethat yields individual file URL paths.
The class sets is_streaming = True as a class attribute, which dataset builders can check to adjust their behavior.
Usage
Use StreamingDownloadManager when implementing or debugging dataset builders that need to operate in streaming mode. It is automatically instantiated by the library when load_dataset(..., streaming=True) is called; direct instantiation is typically only needed in custom builder implementations.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/download/streaming_download_manager.py - Lines: L47-L219
Signature
class StreamingDownloadManager:
is_streaming = True
def __init__(
self,
dataset_name: Optional[str] = None,
data_dir: Optional[str] = None,
download_config: Optional[DownloadConfig] = None,
base_path: Optional[str] = None,
):
Import
from datasets.download.streaming_download_manager import StreamingDownloadManager
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_name | Optional[str] |
No | Name of the dataset being managed. |
| data_dir | Optional[str] |
No | Path to the data directory (used as manual_dir).
|
| download_config | Optional[DownloadConfig] |
No | Configuration for download behavior (authentication, proxies, etc.). |
| base_path | Optional[str] |
No | Base path for resolving relative URLs. Defaults to current working directory. |
Outputs
| Name | Type | Description |
|---|---|---|
| (instance) | StreamingDownloadManager |
A download manager that transforms URLs lazily without downloading files. |
Usage Examples
Basic Usage
from datasets.download.streaming_download_manager import StreamingDownloadManager
dl_manager = StreamingDownloadManager(
base_path="https://huggingface.co/datasets/my_user/my_dataset/resolve/main"
)
# Resolve a download URL (no actual download occurs)
url = dl_manager.download("data/train.jsonl")
# Returns: "https://huggingface.co/datasets/my_user/my_dataset/resolve/main/data/train.jsonl"
# Resolve extraction (prepends protocol for compressed files)
extracted_url = dl_manager.extract("data/train.jsonl.gz")
# Returns: "gzip://train.jsonl::https://...data/train.jsonl.gz"
# Iterate over files in a remote archive
archive_url = dl_manager.download("data/train.tar.gz")
for filename, file_obj in dl_manager.iter_archive(archive_url):
content = file_obj.read()