Implementation:Huggingface Transformers Fetch Hub Objects For CI
| Knowledge Sources | |
|---|---|
| Domains | CI_CD, Testing_Infrastructure |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Concrete tool for pre-downloading all external test data (images, audio, video, datasets, tokenizer files) needed by the CI test suite.
Description
The fetch_hub_objects_for_ci.py utility ensures test reliability by pre-caching all external dependencies before test execution. It maintains a hardcoded list of URLs for test data (COCO images, HuggingFace Hub datasets, audio/video samples). For HuggingFace URLs, it parses the URL pattern and uses hf_hub_download for authenticated downloads. For external URLs, it uses httpx streaming downloads with content validation (checking file headers for HTML error pages and minimum file size). Also pre-downloads specific datasets, model files, and tokenizers based on CI flags.
Usage
Run at the start of CI test jobs to pre-cache test data, preventing network-dependent flaky test failures.
Code Reference
Source Location
- Repository: Huggingface_Transformers
- File: utils/fetch_hub_objects_for_ci.py
- Lines: 1-303
Signature
def url_to_local_path(url: str) -> str:
"""Convert a URL to a local cache path."""
def parse_hf_url(url: str) -> Tuple[str, str, str]:
"""Parse a HuggingFace Hub URL into (repo_id, filename, revision)."""
def validate_downloaded_content(filepath: str) -> bool:
"""Check downloaded file is valid (not HTML error page, meets min size)."""
def download_test_file(url: str, target_dir: str) -> str:
"""Download a test file with validation and caching."""
Import
python utils/fetch_hub_objects_for_ci.py
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Hardcoded URL list | List[str] | Yes | URLs embedded in the script |
| HF_TOKEN | env var | No | HuggingFace token for authenticated downloads |
Outputs
| Name | Type | Description |
|---|---|---|
| Cached files | Files | Downloaded test data in local cache directory |
Usage Examples
Pre-caching Test Data
# Run before test execution in CI
python utils/fetch_hub_objects_for_ci.py
# Typically called in CI pipeline setup step