Implementation:Trailofbits Fickling Hf Download Pickle Files
| Knowledge Sources | |
|---|---|
| Domains | Security, Data_Collection, Pickle_Safety |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
Concrete tool for downloading pickle and PyTorch model files from HuggingFace into a local dataset directory with configurable size filters and download modes.
Description
The hf_download_pickle_files function reads a JSON file of candidate files (typically produced by `listfiles.py`), checks each file's size via HTTP HEAD requests against configurable min/max thresholds, then downloads suitable files. Pickle files (`.pkl`, `.pickle`, `.pk`) are saved directly via _download_pickle_file. PyTorch archives (`.pt`, `.pth`, `.bin`) are handled by _download_torch_file, which can optionally extract embedded pickle files from the zip archive. The function supports three download modes: `default` (abort if dataset exists), `overwrite` (delete and recreate), and `add` (append to existing dataset). It builds an `index.json` manifest of all downloaded files.
Usage
Use this module as the dataset construction step of the benchmark pipeline. It enables automated bulk download of real-world pickle files from HuggingFace to create the clean dataset used for testing scanner accuracy. Run as a CLI script or call `hf_download_pickle_files()` programmatically.
Code Reference
Source Location
- Repository: Trailofbits_Fickling
- File: pickle_scanning_benchmark/download.py
- Lines: 1-200
Signature
def hf_download_pickle_files(
infile: Path,
outdir: Optional[Path] = None,
n: int = 10,
mode: str = "default",
maxsize: int = 500000000,
minsize: int = 0,
extract_pickles: bool = False,
) -> None:
"""
Download pickle and PyTorch files from HuggingFace.
Args:
infile: JSON file with list of candidate files to download.
outdir: Output directory for downloaded files (default: ./pickle_dataset).
n: Number of files to download.
mode: Download mode - 'default', 'overwrite', or 'add'.
maxsize: Maximum file size in bytes to accept.
minsize: Minimum file size in bytes to accept.
extract_pickles: If True, extract pickle files from PyTorch archives.
"""
Import
from pickle_scanning_benchmark.download import hf_download_pickle_files
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| infile | Path | Yes | JSON file listing candidate files with project and filename fields |
| outdir | Path | No | Output directory (default: ./pickle_dataset) |
| n | int | No | Number of files to download (default: 10) |
| mode | str | No | 'default' (abort if exists), 'overwrite', or 'add' (default: 'default') |
| maxsize | int | No | Max file size in bytes (default: 500000000) |
| minsize | int | No | Min file size in bytes (default: 0) |
| extract_pickles | bool | No | Extract pickles from PyTorch archives (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| outdir/ | Directory | Downloaded files saved to output directory |
| outdir/index.json | File | JSON manifest of all downloaded files with metadata (url, file path, type, size) |
Usage Examples
Command Line Usage
# Download 100 files from candidate list
python pickle_scanning_benchmark/download.py candidates.json ./dataset 100
# Add files to existing dataset, filtering by size
python pickle_scanning_benchmark/download.py candidates.json ./dataset 50 \
--mode add --maxsize 10000000 --minsize 1000
# Download and extract pickle files from PyTorch archives
python pickle_scanning_benchmark/download.py candidates.json ./dataset 100 \
--extract-pickles
Programmatic Usage
from pathlib import Path
from pickle_scanning_benchmark.download import hf_download_pickle_files
# Download 50 files with size limits
hf_download_pickle_files(
infile=Path("candidates.json"),
outdir=Path("/data/clean_pickles"),
n=50,
mode="default",
maxsize=10_000_000,
minsize=1000,
extract_pickles=True,
)