Implementation:Huggingface Datasets DownloadManager

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for managing file downloads with caching, extraction, and progress tracking provided by the HuggingFace Datasets library.

Description

DownloadManager is the class responsible for downloading, extracting, and caching dataset files. It provides download(), extract(), and download_and_extract() methods that handle single URLs, lists, or nested dictionaries of URLs uniformly. Internally it delegates to cached_path() for actual HTTP fetching with cache management, uses thread pools for parallel downloads of many small files, records sizes and checksums of all downloaded files, and supports iterating over archives and file trees via iter_archive() and iter_files(). The class is instantiated by DatasetBuilder.download_and_prepare() and passed to dataset-specific _split_generators() methods where the dataset script uses it to acquire raw data.

Usage

Use DownloadManager when implementing a custom dataset loading script and you need to download or extract files within the _split_generators method. In most workflows, DownloadManager is created automatically by the builder; direct instantiation is needed only for advanced or custom pipelines.

Code Reference

Source Location

Repository: datasets
File: src/datasets/download/download_manager.py
Lines: L71-L341

Signature

class DownloadManager:
    is_streaming = False

    def __init__(
        self,
        dataset_name: Optional[str] = None,
        data_dir: Optional[str] = None,
        download_config: Optional[DownloadConfig] = None,
        base_path: Optional[str] = None,
        record_checksums=True,
    ):

Import

from datasets.download.download_manager import DownloadManager

I/O Contract

Inputs

Name	Type	Required	Description
dataset_name	`str`	No	Name of the dataset this manager is used for. Used for logging and tracking.
data_dir	`str`	No	Manual directory to get files from, used for datasets that require manual download.
download_config	`DownloadConfig`	No	Configuration for cache directory, force download/extract flags, proxy settings, number of processes, authentication token, and storage options.
base_path	`str`	No	Base path for resolving relative URLs. Can be a local directory or remote URL. Defaults to `os.path.abspath(".")`.
record_checksums	`bool`	No	Whether to record checksums of downloaded files. Defaults to `True`.

Outputs

Name	Type	Description
(instance)	`DownloadManager`	A configured download manager ready to download, extract, and iterate over files.

Key methods and their return types:

Method	Returns	Description
`download(url_or_urls)`	`str` or `list` or `dict`	Local path(s) of downloaded files, matching the structure of the input.
`extract(path_or_paths)`	`str` or `list` or `dict`	Local path(s) of extracted files, matching the structure of the input.
`download_and_extract(url_or_urls)`	`str` or `list` or `dict`	Local path(s) of downloaded and extracted files.
`iter_archive(path_or_buf)`	`Iterator[tuple[str, io.BufferedReader]]`	Yields (path_within_archive, file_object) pairs.
`iter_files(paths)`	`Iterator[str]`	Yields file paths found under the given root paths.

Usage Examples

Basic Usage

# Typically used inside a DatasetBuilder._split_generators method:
def _split_generators(self, dl_manager):
    downloaded_files = dl_manager.download_and_extract(
        "https://example.com/data/train.tar.gz"
    )
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"filepath": downloaded_files},
        ),
    ]

Downloading Multiple Files

def _split_generators(self, dl_manager):
    urls = {
        "train": "https://example.com/train.csv",
        "test": "https://example.com/test.csv",
    }
    downloaded = dl_manager.download(urls)
    # downloaded["train"] and downloaded["test"] are local file paths
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"filepath": downloaded["train"]},
        ),
        datasets.SplitGenerator(
            name=datasets.Split.TEST,
            gen_kwargs={"filepath": downloaded["test"]},
        ),
    ]

Iterating Over an Archive

def _split_generators(self, dl_manager):
    archive = dl_manager.download(
        "https://example.com/data.tar.gz"
    )
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"archive_iter": dl_manager.iter_archive(archive)},
        ),
    ]

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Download_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment