Principle:Ggml org Ggml File Type Detection
| Knowledge Sources | |
|---|---|
| Domains | Classification, Inference |
| Last Updated | 2026-02-10 |
Overview
File Type Detection uses neural network inference over raw byte content to classify files into one of over a hundred known format categories.
Description
Traditional file-type identification relies on magic bytes -- fixed byte sequences at known offsets (e.g., %PDF at offset 0 for PDF files). While fast, magic-byte checks fail on truncated files, format variants without canonical headers, and adversarial or malformed inputs. File Type Detection as implemented in GGML's Magika example takes a fundamentally different approach: it treats file classification as a machine-learning inference problem.
The system reads three regions of a file -- the beginning (first 512 bytes), the middle (512 bytes around the midpoint), and the end (last 512 bytes) -- and encodes each byte as a one-hot vector of 257 dimensions (256 byte values plus a padding token). This 1536 x 257 input tensor is fed through a small neural network consisting of:
- A dense layer with GELU activation
- Layer normalization
- Two additional dense layers with GELU activation
- Global max pooling over the sequence dimension
- A final classification head with softmax over 113 file-type labels
The model is stored in GGUF format and loaded using the standard GGML tensor infrastructure. Inference runs on the CPU backend and produces a probability distribution over all supported file types, from which the top predictions are reported.
Usage
Apply this principle when building systems that need robust file-type identification beyond what magic-byte heuristics can provide. It is particularly useful for security scanning, content-based routing, and data pipeline ingestion where files may be renamed, truncated, or intentionally disguised. The Magika example demonstrates how to structure a complete GGML inference pipeline -- model loading from GGUF, graph construction, input preparation, and result extraction -- making it also a useful reference for any small-model deployment on GGML.
Theoretical Basis
The theoretical foundation combines byte-level representation learning with content-based classification:
- Byte-Level One-Hot Encoding -- Each byte is represented as a 257-dimensional one-hot vector (256 values plus a padding sentinel). This avoids tokenization assumptions and allows the model to learn directly from raw binary content. The padding token handles files shorter than the expected region size, ensuring fixed-size inputs regardless of file length.
- Multi-Region Sampling -- Rather than reading the entire file (which could be gigabytes), the system samples three fixed-size windows: beginning, middle, and end. This is grounded in the observation that file format signatures are concentrated near the start (headers), the end (trailers, checksums), and sometimes the middle (structural markers). Sampling all three regions gives the model a representative view of the file's structure at negligible I/O cost.
- Dense-Layer Feature Extraction -- The initial dense layer projects each 257-dimensional one-hot vector into a 128-dimensional embedding. This learned embedding is the model's representation of byte context, capturing patterns such as "byte 0x89 followed by 0x50 0x4E 0x47" (PNG header) as distributed features rather than explicit rules.
- Global Max Pooling for Position Invariance -- After processing through dense and normalization layers, global max pooling collapses the sequence dimension. This makes the classification invariant to the exact position of distinguishing byte patterns within each sampled region, so a format signature that shifts by a few bytes across file variants is still detected.
- Softmax Classification -- The output layer produces a probability distribution over 113 file types, allowing both a hard prediction (argmax) and a confidence score. Low-confidence predictions can be flagged for fallback to other identification methods.