Heuristic:Protectai Modelscan Stricter Zip Detection

Knowledge Sources	modelscan PyTorch Python Bug
Domains	Security, Optimization
Last Updated	2026-02-14 12:00 GMT

Overview

ModelScan uses a stricter zip file detection method than Python's `zipfile.is_zipfile()` to avoid false positives with PyTorch model files.

Description

Python's standard `zipfile.is_zipfile()` returns `True` if the zip magic number (`PK\x03\x04`) appears anywhere in the binary content, which can cause false positives when scanning binary model files that happen to contain those bytes. ModelScan instead checks only the first 4 bytes of the file, matching the behavior of PyTorch's own serialization code. This approach is safe because ModelScan expects files generated by `torch.save` or `torch.jit.save`, where the zip header is always at the start.

Usage

This heuristic is applied automatically during model file iteration. Be aware that this stricter detection means some non-standard zip files (where the header is not at byte offset 0) will not be recognized as archives. If a file is a valid zip but has prefixed data, ModelScan will treat it as a non-archive file.

The Insight (Rule of Thumb)

Action: Check only the first 4 bytes of a file to determine if it is a zip archive, rather than scanning the entire binary.
Value: Magic bytes `PK\x03\x04` (hex: `50 4B 03 04`) at offset 0.
Trade-off: May miss zip files with non-standard headers or prepended data, but eliminates false positives from binary model files containing zip-like byte sequences.

Reasoning

The implementation in `tools/utils.py:53-81` includes a detailed comment explaining the rationale:

# This is a stricter implementation than zipfile.is_zipfile().
# zipfile.is_zipfile() is True if the magic number appears anywhere in the
# binary. Since we expect the files here to be generated by torch.save or
# torch.jit.save, it's safe to only check the start bytes and avoid
# collisions and assume the zip has only 1 file.
# See bugs.python.org/issue28494.

The function reads exactly 4 bytes from the start and compares:

local_header_magic_number = [b"P", b"K", b"\x03", b"\x04"]
return read_bytes == local_header_magic_number

This code is copied from the PyTorch source (cited at `tools/utils.py:51-52`):

# copied from pytorch code
# https://github.com/pytorch/pytorch/blob/0b3316ad2c6ff61416597ef29e8865876dcb12f5/torch/serialization.py#L66

Similarly, the PyTorch magic number constant and helper functions in `tools/utils.py:18-48` are also copied from the PyTorch source, ensuring byte-level compatibility with PyTorch's serialization format.

Related Pages

Implementation:Protectai_Modelscan_ModelScan_Scan

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment