Implementation:Trailofbits Fickling Hf Download Pickle Files

Knowledge Sources	Trailofbits_Fickling
Domains	Security, Data_Collection, Pickle_Safety
Last Updated	2026-02-14 14:00 GMT

Overview

Concrete tool for downloading pickle and PyTorch model files from HuggingFace into a local dataset directory with configurable size filters and download modes.

Description

The hf_download_pickle_files function reads a JSON file of candidate files (typically produced by `listfiles.py`), checks each file's size via HTTP HEAD requests against configurable min/max thresholds, then downloads suitable files. Pickle files (`.pkl`, `.pickle`, `.pk`) are saved directly via _download_pickle_file. PyTorch archives (`.pt`, `.pth`, `.bin`) are handled by _download_torch_file, which can optionally extract embedded pickle files from the zip archive. The function supports three download modes: `default` (abort if dataset exists), `overwrite` (delete and recreate), and `add` (append to existing dataset). It builds an `index.json` manifest of all downloaded files.

Usage

Use this module as the dataset construction step of the benchmark pipeline. It enables automated bulk download of real-world pickle files from HuggingFace to create the clean dataset used for testing scanner accuracy. Run as a CLI script or call `hf_download_pickle_files()` programmatically.

Code Reference

Source Location

Repository: Trailofbits_Fickling
File: pickle_scanning_benchmark/download.py
Lines: 1-200

Signature

def hf_download_pickle_files(
    infile: Path,
    outdir: Optional[Path] = None,
    n: int = 10,
    mode: str = "default",
    maxsize: int = 500000000,
    minsize: int = 0,
    extract_pickles: bool = False,
) -> None:
    """
    Download pickle and PyTorch files from HuggingFace.

    Args:
        infile: JSON file with list of candidate files to download.
        outdir: Output directory for downloaded files (default: ./pickle_dataset).
        n: Number of files to download.
        mode: Download mode - 'default', 'overwrite', or 'add'.
        maxsize: Maximum file size in bytes to accept.
        minsize: Minimum file size in bytes to accept.
        extract_pickles: If True, extract pickle files from PyTorch archives.
    """

Import

from pickle_scanning_benchmark.download import hf_download_pickle_files

I/O Contract

Inputs

Name	Type	Required	Description
infile	Path	Yes	JSON file listing candidate files with project and filename fields
outdir	Path	No	Output directory (default: ./pickle_dataset)
n	int	No	Number of files to download (default: 10)
mode	str	No	'default' (abort if exists), 'overwrite', or 'add' (default: 'default')
maxsize	int	No	Max file size in bytes (default: 500000000)
minsize	int	No	Min file size in bytes (default: 0)
extract_pickles	bool	No	Extract pickles from PyTorch archives (default: False)

Outputs

Name	Type	Description
outdir/	Directory	Downloaded files saved to output directory
outdir/index.json	File	JSON manifest of all downloaded files with metadata (url, file path, type, size)

Usage Examples

Command Line Usage

# Download 100 files from candidate list
python pickle_scanning_benchmark/download.py candidates.json ./dataset 100

# Add files to existing dataset, filtering by size
python pickle_scanning_benchmark/download.py candidates.json ./dataset 50 \
    --mode add --maxsize 10000000 --minsize 1000

# Download and extract pickle files from PyTorch archives
python pickle_scanning_benchmark/download.py candidates.json ./dataset 100 \
    --extract-pickles

Programmatic Usage

from pathlib import Path
from pickle_scanning_benchmark.download import hf_download_pickle_files

# Download 50 files with size limits
hf_download_pickle_files(
    infile=Path("candidates.json"),
    outdir=Path("/data/clean_pickles"),
    n=50,
    mode="default",
    maxsize=10_000_000,
    minsize=1000,
    extract_pickles=True,
)

Related Pages

Implements Principle

Principle:Trailofbits_Fickling_Benchmark_Dataset_Construction

Requires Environment

Environment:Trailofbits_Fickling_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment