Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Trailofbits Fickling Get Stats

From Leeroopedia
Revision as of 13:57, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Trailofbits_Fickling_Get_Stats.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Security, Data_Analysis, Pickle_Safety
Last Updated 2026-02-14 14:00 GMT

Overview

Concrete tool for computing and exporting statistics about a downloaded pickle file dataset, including import frequency counts and HuggingFace download numbers.

Description

The get_stats function iterates over files listed in a dataset's `index.json` and builds a Stats dataclass that aggregates information about the dataset's composition. For each file, it uses fickling's Pickled and PyTorchModelWrapper to extract Python import statements embedded in the pickle bytecode, and queries the HuggingFace API for per-project download counts. Results are stored in dictionaries and can be exported to CSV files (`imports.csv`, `downloads.csv`).

Usage

Use this module to analyze the composition of a benchmark pickle dataset. It provides insight into which Python modules are commonly imported in real-world pickle files and how popular the source models are on HuggingFace. Run as a CLI script with the dataset directory path.

Code Reference

Source Location

Signature

def get_stats(dataset_dir: Path) -> Stats:
    """
    Compute statistics for all files in a pickle dataset.

    Args:
        dataset_dir: Path to dataset directory containing index.json.

    Returns:
        Stats dataclass with file counts, import frequencies, and project downloads.
    """

@dataclass
class Stats:
    nb_files: int = 0
    nb_downloads: int = 0
    projects: dict = field(default_factory=dict)
    file_types: dict = field(default_factory=dict)
    imports: dict = field(default_factory=dict)

    def add(self, file: dict) -> None: ...
    def finalise(self) -> None: ...
    def dump_imports(self) -> None: ...
    def dump_project_downloads(self) -> None: ...

Import

from pickle_scanning_benchmark.dataset_stats import get_stats, Stats

I/O Contract

Inputs

Name Type Required Description
dataset_dir Path Yes Directory containing index.json with file metadata

Outputs

Name Type Description
Stats dataclass Contains nb_files, projects (with download counts), file_types, and imports frequency dict
imports.csv File CSV of import statement frequencies (written by dump_imports)
downloads.csv File CSV of per-project download counts (written by dump_project_downloads)

Usage Examples

Command Line Usage

python pickle_scanning_benchmark/dataset_stats.py /path/to/dataset

Programmatic Usage

from pathlib import Path
from pickle_scanning_benchmark.dataset_stats import get_stats

# Compute stats for a dataset
stats = get_stats(Path("/data/clean_pickles"))

# Export results
stats.dump_imports()        # Writes imports.csv
stats.dump_project_downloads()  # Writes downloads.csv

# Print summary
print(f"Total files: {stats.nb_files}")
print(f"Total projects: {len(stats.projects)}")
print(f"Total downloads: {sum(stats.projects.values())}")
print(f"Most common imports: {dict(list(stats.imports.items())[-5:])}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment