Implementation:Huggingface Datasets DownloadManager
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for managing file downloads with caching, extraction, and progress tracking provided by the HuggingFace Datasets library.
Description
DownloadManager is the class responsible for downloading, extracting, and caching dataset files. It provides download(), extract(), and download_and_extract() methods that handle single URLs, lists, or nested dictionaries of URLs uniformly. Internally it delegates to cached_path() for actual HTTP fetching with cache management, uses thread pools for parallel downloads of many small files, records sizes and checksums of all downloaded files, and supports iterating over archives and file trees via iter_archive() and iter_files(). The class is instantiated by DatasetBuilder.download_and_prepare() and passed to dataset-specific _split_generators() methods where the dataset script uses it to acquire raw data.
Usage
Use DownloadManager when implementing a custom dataset loading script and you need to download or extract files within the _split_generators method. In most workflows, DownloadManager is created automatically by the builder; direct instantiation is needed only for advanced or custom pipelines.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/download/download_manager.py - Lines: L71-L341
Signature
class DownloadManager:
is_streaming = False
def __init__(
self,
dataset_name: Optional[str] = None,
data_dir: Optional[str] = None,
download_config: Optional[DownloadConfig] = None,
base_path: Optional[str] = None,
record_checksums=True,
):
Import
from datasets.download.download_manager import DownloadManager
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_name | str |
No | Name of the dataset this manager is used for. Used for logging and tracking. |
| data_dir | str |
No | Manual directory to get files from, used for datasets that require manual download. |
| download_config | DownloadConfig |
No | Configuration for cache directory, force download/extract flags, proxy settings, number of processes, authentication token, and storage options. |
| base_path | str |
No | Base path for resolving relative URLs. Can be a local directory or remote URL. Defaults to os.path.abspath(".").
|
| record_checksums | bool |
No | Whether to record checksums of downloaded files. Defaults to True.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (instance) | DownloadManager |
A configured download manager ready to download, extract, and iterate over files. |
Key methods and their return types:
| Method | Returns | Description |
|---|---|---|
download(url_or_urls) |
str or list or dict |
Local path(s) of downloaded files, matching the structure of the input. |
extract(path_or_paths) |
str or list or dict |
Local path(s) of extracted files, matching the structure of the input. |
download_and_extract(url_or_urls) |
str or list or dict |
Local path(s) of downloaded and extracted files. |
iter_archive(path_or_buf) |
Iterator[tuple[str, io.BufferedReader]] |
Yields (path_within_archive, file_object) pairs. |
iter_files(paths) |
Iterator[str] |
Yields file paths found under the given root paths. |
Usage Examples
Basic Usage
# Typically used inside a DatasetBuilder._split_generators method:
def _split_generators(self, dl_manager):
downloaded_files = dl_manager.download_and_extract(
"https://example.com/data/train.tar.gz"
)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"filepath": downloaded_files},
),
]
Downloading Multiple Files
def _split_generators(self, dl_manager):
urls = {
"train": "https://example.com/train.csv",
"test": "https://example.com/test.csv",
}
downloaded = dl_manager.download(urls)
# downloaded["train"] and downloaded["test"] are local file paths
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"filepath": downloaded["train"]},
),
datasets.SplitGenerator(
name=datasets.Split.TEST,
gen_kwargs={"filepath": downloaded["test"]},
),
]
Iterating Over an Archive
def _split_generators(self, dl_manager):
archive = dl_manager.download(
"https://example.com/data.tar.gz"
)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={"archive_iter": dl_manager.iter_archive(archive)},
),
]