Implementation:Trailofbits Fickling Get Stats
| Knowledge Sources | |
|---|---|
| Domains | Security, Data_Analysis, Pickle_Safety |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
Concrete tool for computing and exporting statistics about a downloaded pickle file dataset, including import frequency counts and HuggingFace download numbers.
Description
The get_stats function iterates over files listed in a dataset's `index.json` and builds a Stats dataclass that aggregates information about the dataset's composition. For each file, it uses fickling's Pickled and PyTorchModelWrapper to extract Python import statements embedded in the pickle bytecode, and queries the HuggingFace API for per-project download counts. Results are stored in dictionaries and can be exported to CSV files (`imports.csv`, `downloads.csv`).
Usage
Use this module to analyze the composition of a benchmark pickle dataset. It provides insight into which Python modules are commonly imported in real-world pickle files and how popular the source models are on HuggingFace. Run as a CLI script with the dataset directory path.
Code Reference
Source Location
- Repository: Trailofbits_Fickling
- File: pickle_scanning_benchmark/dataset_stats.py
- Lines: 1-101
Signature
def get_stats(dataset_dir: Path) -> Stats:
"""
Compute statistics for all files in a pickle dataset.
Args:
dataset_dir: Path to dataset directory containing index.json.
Returns:
Stats dataclass with file counts, import frequencies, and project downloads.
"""
@dataclass
class Stats:
nb_files: int = 0
nb_downloads: int = 0
projects: dict = field(default_factory=dict)
file_types: dict = field(default_factory=dict)
imports: dict = field(default_factory=dict)
def add(self, file: dict) -> None: ...
def finalise(self) -> None: ...
def dump_imports(self) -> None: ...
def dump_project_downloads(self) -> None: ...
Import
from pickle_scanning_benchmark.dataset_stats import get_stats, Stats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_dir | Path | Yes | Directory containing index.json with file metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| Stats | dataclass | Contains nb_files, projects (with download counts), file_types, and imports frequency dict |
| imports.csv | File | CSV of import statement frequencies (written by dump_imports) |
| downloads.csv | File | CSV of per-project download counts (written by dump_project_downloads) |
Usage Examples
Command Line Usage
python pickle_scanning_benchmark/dataset_stats.py /path/to/dataset
Programmatic Usage
from pathlib import Path
from pickle_scanning_benchmark.dataset_stats import get_stats
# Compute stats for a dataset
stats = get_stats(Path("/data/clean_pickles"))
# Export results
stats.dump_imports() # Writes imports.csv
stats.dump_project_downloads() # Writes downloads.csv
# Print summary
print(f"Total files: {stats.nb_files}")
print(f"Total projects: {len(stats.projects)}")
print(f"Total downloads: {sum(stats.projects.values())}")
print(f"Most common imports: {dict(list(stats.imports.items())[-5:])}")