Implementation:Trailofbits Fickling Get Stats

Knowledge Sources	Trailofbits_Fickling
Domains	Security, Data_Analysis, Pickle_Safety
Last Updated	2026-02-14 14:00 GMT

Overview

Concrete tool for computing and exporting statistics about a downloaded pickle file dataset, including import frequency counts and HuggingFace download numbers.

Description

The get_stats function iterates over files listed in a dataset's `index.json` and builds a Stats dataclass that aggregates information about the dataset's composition. For each file, it uses fickling's Pickled and PyTorchModelWrapper to extract Python import statements embedded in the pickle bytecode, and queries the HuggingFace API for per-project download counts. Results are stored in dictionaries and can be exported to CSV files (`imports.csv`, `downloads.csv`).

Usage

Use this module to analyze the composition of a benchmark pickle dataset. It provides insight into which Python modules are commonly imported in real-world pickle files and how popular the source models are on HuggingFace. Run as a CLI script with the dataset directory path.

Code Reference

Source Location

Repository: Trailofbits_Fickling
File: pickle_scanning_benchmark/dataset_stats.py
Lines: 1-101

Signature

def get_stats(dataset_dir: Path) -> Stats:
    """
    Compute statistics for all files in a pickle dataset.

    Args:
        dataset_dir: Path to dataset directory containing index.json.

    Returns:
        Stats dataclass with file counts, import frequencies, and project downloads.
    """

@dataclass
class Stats:
    nb_files: int = 0
    nb_downloads: int = 0
    projects: dict = field(default_factory=dict)
    file_types: dict = field(default_factory=dict)
    imports: dict = field(default_factory=dict)

    def add(self, file: dict) -> None: ...
    def finalise(self) -> None: ...
    def dump_imports(self) -> None: ...
    def dump_project_downloads(self) -> None: ...

Import

from pickle_scanning_benchmark.dataset_stats import get_stats, Stats

I/O Contract

Inputs

Name	Type	Required	Description
dataset_dir	Path	Yes	Directory containing index.json with file metadata

Outputs

Name	Type	Description
Stats	dataclass	Contains nb_files, projects (with download counts), file_types, and imports frequency dict
imports.csv	File	CSV of import statement frequencies (written by dump_imports)
downloads.csv	File	CSV of per-project download counts (written by dump_project_downloads)

Usage Examples

Command Line Usage

python pickle_scanning_benchmark/dataset_stats.py /path/to/dataset

Programmatic Usage

from pathlib import Path
from pickle_scanning_benchmark.dataset_stats import get_stats

# Compute stats for a dataset
stats = get_stats(Path("/data/clean_pickles"))

# Export results
stats.dump_imports()        # Writes imports.csv
stats.dump_project_downloads()  # Writes downloads.csv

# Print summary
print(f"Total files: {stats.nb_files}")
print(f"Total projects: {len(stats.projects)}")
print(f"Total downloads: {sum(stats.projects.values())}")
print(f"Most common imports: {dict(list(stats.imports.items())[-5:])}")

Related Pages

Implements Principle

Principle:Trailofbits_Fickling_Pickle_Dataset_Analysis

Requires Environment

Environment:Trailofbits_Fickling_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment