Heuristic:Protectai Modelscan Stricter Zip Detection
| Knowledge Sources | |
|---|---|
| Domains | Security, Optimization |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
ModelScan uses a stricter zip file detection method than Python's `zipfile.is_zipfile()` to avoid false positives with PyTorch model files.
Description
Python's standard `zipfile.is_zipfile()` returns `True` if the zip magic number (`PK\x03\x04`) appears anywhere in the binary content, which can cause false positives when scanning binary model files that happen to contain those bytes. ModelScan instead checks only the first 4 bytes of the file, matching the behavior of PyTorch's own serialization code. This approach is safe because ModelScan expects files generated by `torch.save` or `torch.jit.save`, where the zip header is always at the start.
Usage
This heuristic is applied automatically during model file iteration. Be aware that this stricter detection means some non-standard zip files (where the header is not at byte offset 0) will not be recognized as archives. If a file is a valid zip but has prefixed data, ModelScan will treat it as a non-archive file.
The Insight (Rule of Thumb)
- Action: Check only the first 4 bytes of a file to determine if it is a zip archive, rather than scanning the entire binary.
- Value: Magic bytes `PK\x03\x04` (hex: `50 4B 03 04`) at offset 0.
- Trade-off: May miss zip files with non-standard headers or prepended data, but eliminates false positives from binary model files containing zip-like byte sequences.
Reasoning
The implementation in `tools/utils.py:53-81` includes a detailed comment explaining the rationale:
# This is a stricter implementation than zipfile.is_zipfile().
# zipfile.is_zipfile() is True if the magic number appears anywhere in the
# binary. Since we expect the files here to be generated by torch.save or
# torch.jit.save, it's safe to only check the start bytes and avoid
# collisions and assume the zip has only 1 file.
# See bugs.python.org/issue28494.
The function reads exactly 4 bytes from the start and compares:
local_header_magic_number = [b"P", b"K", b"\x03", b"\x04"]
return read_bytes == local_header_magic_number
This code is copied from the PyTorch source (cited at `tools/utils.py:51-52`):
# copied from pytorch code
# https://github.com/pytorch/pytorch/blob/0b3316ad2c6ff61416597ef29e8865876dcb12f5/torch/serialization.py#L66
Similarly, the PyTorch magic number constant and helper functions in `tools/utils.py:18-48` are also copied from the PyTorch source, ensuring byte-level compatibility with PyTorch's serialization format.