Principle:Trailofbits Fickling Pickle Dataset Analysis
| Knowledge Sources | |
|---|---|
| Domains | Security, Data_Analysis, Pickle_Safety |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
Analytical methodology for extracting and aggregating statistical properties from collections of pickle files to characterize real-world pickle usage patterns.
Description
Pickle Dataset Analysis inspects a corpus of pickle files to extract metadata about their contents, particularly the Python import statements embedded in pickle bytecode. By decompiling each pickle file's opcodes into an AST representation, the analysis identifies which modules and functions are being serialized (e.g., `torch.nn.Linear`, `numpy.array`). Combined with external metadata such as HuggingFace download counts for source models, this provides a comprehensive view of how pickle files are used in practice across the ML ecosystem. This data informs allowlist construction and helps researchers understand which import patterns are benign versus suspicious.
Usage
Apply this principle when building or validating an ML pickle allowlist, characterizing the attack surface of pickle-based model distribution, or understanding the composition of a benchmark dataset before running scanner evaluations.
Theoretical Basis
The analysis follows an extract-aggregate-export pipeline:
# Abstract algorithm
stats = {}
for file in dataset:
pickled = decompile(file) # Parse pickle bytecode to AST
imports = extract_imports(pickled) # Extract import statements
for imp in imports:
stats[imp] = stats.get(imp, 0) + 1 # Aggregate frequency
# Sort by frequency and export
stats = sorted(stats, by_frequency)
export_csv(stats)
The import extraction leverages fickling's pickle-to-AST decompilation, which converts GLOBAL and STACK_GLOBAL opcodes into Python import statements.