Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets DownloadManager

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for managing file downloads with caching, extraction, and progress tracking provided by the HuggingFace Datasets library.

Description

DownloadManager is the class responsible for downloading, extracting, and caching dataset files. It provides download(), extract(), and download_and_extract() methods that handle single URLs, lists, or nested dictionaries of URLs uniformly. Internally it delegates to cached_path() for actual HTTP fetching with cache management, uses thread pools for parallel downloads of many small files, records sizes and checksums of all downloaded files, and supports iterating over archives and file trees via iter_archive() and iter_files(). The class is instantiated by DatasetBuilder.download_and_prepare() and passed to dataset-specific _split_generators() methods where the dataset script uses it to acquire raw data.

Usage

Use DownloadManager when implementing a custom dataset loading script and you need to download or extract files within the _split_generators method. In most workflows, DownloadManager is created automatically by the builder; direct instantiation is needed only for advanced or custom pipelines.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/download/download_manager.py
  • Lines: L71-L341

Signature

class DownloadManager:
    is_streaming = False

    def __init__(
        self,
        dataset_name: Optional[str] = None,
        data_dir: Optional[str] = None,
        download_config: Optional[DownloadConfig] = None,
        base_path: Optional[str] = None,
        record_checksums=True,
    ):

Import

from datasets.download.download_manager import DownloadManager

I/O Contract

Inputs

Name Type Required Description
dataset_name str No Name of the dataset this manager is used for. Used for logging and tracking.
data_dir str No Manual directory to get files from, used for datasets that require manual download.
download_config DownloadConfig No Configuration for cache directory, force download/extract flags, proxy settings, number of processes, authentication token, and storage options.
base_path str No Base path for resolving relative URLs. Can be a local directory or remote URL. Defaults to os.path.abspath(".").
record_checksums bool No Whether to record checksums of downloaded files. Defaults to True.

Outputs

Name Type Description
(instance) DownloadManager A configured download manager ready to download, extract, and iterate over files.

Key methods and their return types:

Method Returns Description
download(url_or_urls) str or list or dict Local path(s) of downloaded files, matching the structure of the input.
extract(path_or_paths) str or list or dict Local path(s) of extracted files, matching the structure of the input.
download_and_extract(url_or_urls) str or list or dict Local path(s) of downloaded and extracted files.
iter_archive(path_or_buf) Iterator[tuple[str, io.BufferedReader]] Yields (path_within_archive, file_object) pairs.
iter_files(paths) Iterator[str] Yields file paths found under the given root paths.

Usage Examples

Basic Usage

# Typically used inside a DatasetBuilder._split_generators method:
def _split_generators(self, dl_manager):
    downloaded_files = dl_manager.download_and_extract(
        "https://example.com/data/train.tar.gz"
    )
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"filepath": downloaded_files},
        ),
    ]

Downloading Multiple Files

def _split_generators(self, dl_manager):
    urls = {
        "train": "https://example.com/train.csv",
        "test": "https://example.com/test.csv",
    }
    downloaded = dl_manager.download(urls)
    # downloaded["train"] and downloaded["test"] are local file paths
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"filepath": downloaded["train"]},
        ),
        datasets.SplitGenerator(
            name=datasets.Split.TEST,
            gen_kwargs={"filepath": downloaded["test"]},
        ),
    ]

Iterating Over an Archive

def _split_generators(self, dl_manager):
    archive = dl_manager.download(
        "https://example.com/data.tar.gz"
    )
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={"archive_iter": dl_manager.iter_archive(archive)},
        ),
    ]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment