Implementation:Trailofbits Fickling Run Benchmark
| Knowledge Sources | |
|---|---|
| Domains | Security, Benchmarking, Pickle_Safety |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
Concrete tool for running comparative accuracy benchmarks of multiple pickle scanning tools against clean and malicious file datasets.
Description
The run_benchmark function is the core evaluation harness in the fickling benchmark suite. It loads clean and malicious pickle file indexes, randomly samples files at a configurable ratio, then runs each registered scanning tool on every sampled file. Results are tracked in ToolResults and BenchmarkResults dataclasses that record true positives, true negatives, false positives, false negatives, per-payload-type miss statistics, and scan failure counts. The module also includes wrapper functions for four scanning tools: Fickling, Modelscan, Picklescan, and Model Unpickler.
Usage
Use this module when you need to quantitatively compare pickle security scanning tools on realistic datasets. It is invoked as a CLI script with paths to clean and malicious dataset directories, or imported to call `run_benchmark()` programmatically with custom tool registrations.
Code Reference
Source Location
- Repository: Trailofbits_Fickling
- File: pickle_scanning_benchmark/benchmark.py
- Lines: 1-323
Signature
def run_benchmark(
clean_dataset_dir: Path,
malicious_dataset_dir: Path,
tools: dict,
n: int = 10000,
clean_to_malicious_ratio: float = 2.0,
) -> None:
"""
Run benchmark comparing scanning tools on clean and malicious datasets.
Args:
clean_dataset_dir: Path to directory containing clean file index.json.
malicious_dataset_dir: Path to directory containing malicious file index.json.
tools: Dict mapping tool names to callable run functions (signature: func(filepath, filetype) -> bool).
n: Total number of files to sample for the benchmark.
clean_to_malicious_ratio: Ratio of clean to malicious files in the sample.
"""
Import
from pickle_scanning_benchmark.benchmark import run_benchmark, BenchmarkResults, ToolResults
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| clean_dataset_dir | Path | Yes | Directory with clean dataset index.json |
| malicious_dataset_dir | Path | Yes | Directory with malicious dataset index.json |
| tools | dict | Yes | Map of tool name to callable `func(filepath, filetype) -> bool` |
| n | int | No | Total files to sample (default: 10000) |
| clean_to_malicious_ratio | float | No | Ratio of clean to malicious files (default: 2.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | str | Formatted benchmark results printed to console |
| BenchmarkResults | dataclass | Contains per-tool ToolResults with TP/TN/FP/FN counts and payload-type miss stats |
Usage Examples
Running from Command Line
python pickle_scanning_benchmark/benchmark.py /path/to/clean_dataset /path/to/malicious_dataset
Programmatic Usage
from pathlib import Path
from pickle_scanning_benchmark.benchmark import run_benchmark, run_fickling
# Define tools to benchmark
tools = {
"Fickling": run_fickling,
}
# Run benchmark
run_benchmark(
clean_dataset_dir=Path("/data/clean_pickles"),
malicious_dataset_dir=Path("/data/malicious_pickles"),
tools=tools,
n=1000,
clean_to_malicious_ratio=2.0,
)
Registering a Custom Scanner
def run_custom_scanner(filepath: str, filetype: str) -> bool:
"""Return True if file is considered safe, False otherwise."""
# Custom scanning logic here
return True
tools = {
"Fickling": run_fickling,
"CustomScanner": run_custom_scanner,
}
run_benchmark(Path("clean"), Path("malicious"), tools)