Implementation:Huggingface Datasets DownloadConfig
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for configuring download behavior (caching, proxies, retries) before loading datasets, provided by the HuggingFace Datasets library.
Description
DownloadConfig is a Python @dataclass that encapsulates all parameters controlling how dataset files are downloaded, cached, extracted, and authenticated. It is accepted by most dataset loading and inspection functions (e.g. load_dataset, load_dataset_builder, get_dataset_config_names) and is threaded through to the underlying download manager. It provides a copy() method for safe modification and a custom __setattr__ that automatically propagates token changes to storage_options.
Usage
Use DownloadConfig when you need fine-grained control over download behavior beyond the defaults. Pass an instance as the download_config parameter to any dataset loading or inspection function.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/download/download_config.py - Lines: 10-81
Signature
@dataclass
class DownloadConfig:
cache_dir: Optional[Union[str, Path]] = None
force_download: bool = False
resume_download: bool = False
local_files_only: bool = False
proxies: Optional[dict] = None
user_agent: Optional[str] = None
extract_compressed_file: bool = False
force_extract: bool = False
delete_extracted: bool = False
extract_on_the_fly: bool = False
use_etag: bool = True
num_proc: Optional[int] = None
max_retries: int = 1
token: Optional[Union[str, bool]] = None
storage_options: dict[str, Any] = field(default_factory=dict)
download_desc: Optional[str] = None
disable_tqdm: bool = False
Import
from datasets import DownloadConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cache_dir | Optional[Union[str, Path]] |
No | Custom cache directory to save files to (overrides default cache dir). |
| force_download | bool |
No | If True, re-download files even if already cached. Default: False.
|
| resume_download | bool |
No | If True, resume incomplete downloads. Default: False.
|
| local_files_only | bool |
No | If True, only use locally cached files without making network requests. Default: False.
|
| proxies | Optional[dict] |
No | Dictionary of proxy URLs keyed by protocol. |
| user_agent | Optional[str] |
No | Custom string appended to the user-agent header on remote requests. |
| extract_compressed_file | bool |
No | If True, extract zip/tar files in a folder alongside the archive. Default: False.
|
| force_extract | bool |
No | If True, re-extract archives even if already extracted. Default: False.
|
| delete_extracted | bool |
No | Whether to delete extracted files after use. Default: False.
|
| extract_on_the_fly | bool |
No | If True, extract compressed files during reading. Default: False.
|
| use_etag | bool |
No | Whether to use ETag headers to validate cached files. Default: True.
|
| num_proc | Optional[int] |
No | Number of parallel download processes. |
| max_retries | int |
No | Number of HTTP request retries on failure. Default: 1.
|
| token | Optional[Union[str, bool]] |
No | Bearer token for Hub authentication. If True, reads from ~/.huggingface.
|
| storage_options | dict[str, Any] |
No | Key/value pairs passed to the dataset file-system backend. Default: empty dict. |
| download_desc | Optional[str] |
No | Description displayed alongside the download progress bar. |
| disable_tqdm | bool |
No | Whether to disable the download progress bar. Default: False.
|
Outputs
| Name | Type | Description |
|---|---|---|
| instance | DownloadConfig |
A configured DownloadConfig dataclass instance ready to be passed to dataset loading functions.
|
Usage Examples
Basic Usage
from datasets import load_dataset, DownloadConfig
# Configure downloads with a custom cache directory and retries
dl_config = DownloadConfig(
cache_dir="/tmp/my_cache",
max_retries=3,
token="hf_xxxxxxxxxxxxx",
)
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", download_config=dl_config)
Offline / Local-Only Mode
from datasets import load_dataset, DownloadConfig
dl_config = DownloadConfig(local_files_only=True)
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", download_config=dl_config)
Copying and Modifying Configuration
from datasets import DownloadConfig
base_config = DownloadConfig(max_retries=3, use_etag=True)
# Create a modified copy with a specific token
authenticated_config = base_config.copy()
authenticated_config.token = "hf_xxxxxxxxxxxxx"