Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets DownloadConfig

From Leeroopedia
Revision as of 12:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_DownloadConfig.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for configuring download behavior (caching, proxies, retries) before loading datasets, provided by the HuggingFace Datasets library.

Description

DownloadConfig is a Python @dataclass that encapsulates all parameters controlling how dataset files are downloaded, cached, extracted, and authenticated. It is accepted by most dataset loading and inspection functions (e.g. load_dataset, load_dataset_builder, get_dataset_config_names) and is threaded through to the underlying download manager. It provides a copy() method for safe modification and a custom __setattr__ that automatically propagates token changes to storage_options.

Usage

Use DownloadConfig when you need fine-grained control over download behavior beyond the defaults. Pass an instance as the download_config parameter to any dataset loading or inspection function.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/download/download_config.py
  • Lines: 10-81

Signature

@dataclass
class DownloadConfig:
    cache_dir: Optional[Union[str, Path]] = None
    force_download: bool = False
    resume_download: bool = False
    local_files_only: bool = False
    proxies: Optional[dict] = None
    user_agent: Optional[str] = None
    extract_compressed_file: bool = False
    force_extract: bool = False
    delete_extracted: bool = False
    extract_on_the_fly: bool = False
    use_etag: bool = True
    num_proc: Optional[int] = None
    max_retries: int = 1
    token: Optional[Union[str, bool]] = None
    storage_options: dict[str, Any] = field(default_factory=dict)
    download_desc: Optional[str] = None
    disable_tqdm: bool = False

Import

from datasets import DownloadConfig

I/O Contract

Inputs

Name Type Required Description
cache_dir Optional[Union[str, Path]] No Custom cache directory to save files to (overrides default cache dir).
force_download bool No If True, re-download files even if already cached. Default: False.
resume_download bool No If True, resume incomplete downloads. Default: False.
local_files_only bool No If True, only use locally cached files without making network requests. Default: False.
proxies Optional[dict] No Dictionary of proxy URLs keyed by protocol.
user_agent Optional[str] No Custom string appended to the user-agent header on remote requests.
extract_compressed_file bool No If True, extract zip/tar files in a folder alongside the archive. Default: False.
force_extract bool No If True, re-extract archives even if already extracted. Default: False.
delete_extracted bool No Whether to delete extracted files after use. Default: False.
extract_on_the_fly bool No If True, extract compressed files during reading. Default: False.
use_etag bool No Whether to use ETag headers to validate cached files. Default: True.
num_proc Optional[int] No Number of parallel download processes.
max_retries int No Number of HTTP request retries on failure. Default: 1.
token Optional[Union[str, bool]] No Bearer token for Hub authentication. If True, reads from ~/.huggingface.
storage_options dict[str, Any] No Key/value pairs passed to the dataset file-system backend. Default: empty dict.
download_desc Optional[str] No Description displayed alongside the download progress bar.
disable_tqdm bool No Whether to disable the download progress bar. Default: False.

Outputs

Name Type Description
instance DownloadConfig A configured DownloadConfig dataclass instance ready to be passed to dataset loading functions.

Usage Examples

Basic Usage

from datasets import load_dataset, DownloadConfig

# Configure downloads with a custom cache directory and retries
dl_config = DownloadConfig(
    cache_dir="/tmp/my_cache",
    max_retries=3,
    token="hf_xxxxxxxxxxxxx",
)

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", download_config=dl_config)

Offline / Local-Only Mode

from datasets import load_dataset, DownloadConfig

dl_config = DownloadConfig(local_files_only=True)
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", download_config=dl_config)

Copying and Modifying Configuration

from datasets import DownloadConfig

base_config = DownloadConfig(max_retries=3, use_etag=True)

# Create a modified copy with a specific token
authenticated_config = base_config.copy()
authenticated_config.token = "hf_xxxxxxxxxxxxx"

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment