Principle:Protectai Modelscan Model File Abstraction
| Knowledge Sources | |
|---|---|
| Domains | ML_Security, Software_Architecture |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
A uniform file abstraction that wraps both filesystem paths and in-memory byte streams into a single interface, enabling scanners to process top-level files and zip archive entries identically.
Description
Model File Abstraction solves a key challenge in model scanning: ML model files can exist as standalone files on disk, or as entries within zip archives (e.g., PyTorch .pt files, .npz files, .keras files). Without abstraction, every scanner would need separate code paths for filesystem files and zip entries.
The abstraction provides a unified interface with three capabilities:
- Source identification: A path-like identifier for the file (filesystem path or "archive:entry" notation)
- Stream access: A seekable byte stream for reading file contents
- Context metadata: A key-value store for attaching preprocessing results (e.g., detected format)
The abstraction also implements the context manager protocol, automatically opening file streams on entry and closing them on exit, preventing resource leaks during scanning.
Usage
Apply this principle when:
- Understanding how modelscan handles both regular files and zip archive contents
- Implementing a scanner that needs to read model file bytes
- Working with the middleware pipeline that attaches format context to models
- Iterating over files in a directory that may contain zip archives
Theoretical Basis
The abstraction follows the Adapter pattern, presenting a uniform interface over two different data sources:
# Pseudo-code for the Model abstraction
class Model:
def __init__(self, source, stream=None):
"""
source: Path (for files) or str "archive:entry" (for zip contents)
stream: None (will open file) or IO[bytes] (pre-opened zip entry)
"""
def get_source(self) -> Path:
"""Return path identifier."""
def get_stream(self, offset=0) -> IO[bytes]:
"""Return seekable byte stream positioned at offset."""
def get_context(self, key) -> Any:
"""Get metadata (e.g., detected format)."""
def set_context(self, key, value) -> None:
"""Set metadata (used by middleware)."""
The iteration logic in _iterate_models() produces Model objects for both cases:
# Pseudo-code for model iteration
for file in files:
with Model(file) as model:
yield model # Top-level file
if is_zipfile(file):
for entry in zip.namelist():
yield Model(f"{file}:{entry}", zip.open(entry))
This ensures scanners receive a consistent interface regardless of whether the data comes from disk or a zip entry.