Principle:Turboderp org Exllamav2 Model Compilation
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Serialization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Model compilation is the final assembly step in a quantization pipeline where individually quantized per-layer weight tensors are combined with non-quantized parameters into a self-contained, distributable model directory.
Description
During the quantization process, each linear layer is quantized independently and saved as a separate safetensors file. However, a usable model also requires non-quantized components such as:
- Token embeddings (the input embedding matrix)
- Position embeddings (if present)
- Layer normalization weights (RMSNorm or LayerNorm parameters)
- Router/gate weights for Mixture-of-Experts architectures
- Multimodal components (vision encoder weights, if applicable)
The compilation step collects all these tensors, interleaves them in the correct order, and writes them into one or more sharded safetensors files. It also updates the model's config.json with a quantization_config section that records the quantization method, target bitrate, calibration parameters, and exllamav2 version. The result is a directory that can be loaded directly by exllamav2 for inference, with no reference to the original FP16 model needed.
Usage
Model compilation is the fifth and final step in the EXL2 conversion pipeline, executed after all layers have been quantized. It produces the distributable output that end users load for inference.
Theoretical Basis
Safetensors Sharding
Large models may exceed the memory available to a single process or the file size limits of certain filesystems. The compilation step supports sharding: splitting the output into multiple files, each no larger than a configurable maximum (default: 8192 MB). The sharding algorithm is straightforward:
- Accumulate tensors into an output buffer.
- When the buffer exceeds the shard size threshold, write the current buffer as a numbered shard file.
- Continue until all tensors are written.
Naming Convention
- Single shard:
output.safetensors - Multiple shards:
output-00001-of-NNNNN.safetensors,output-00002-of-NNNNN.safetensors, etc.
Metadata Injection
The config.json is augmented with:
{
"quantization_config": {
"quant_method": "exl2",
"version": "0.x.y",
"bits": 4.125,
"head_bits": 6,
"calibration": {
"rows": 100,
"length": 2048,
"dataset": "(default)"
}
}
}
This metadata allows downstream tools and users to understand the quantization provenance of the model without inspecting the weight tensors.
Full Model Compilation
When the compile_full option is enabled, the compilation step also copies all non-tensor files from the source model directory (tokenizer files, special tokens map, generation config, etc.) to the output directory. This produces a fully self-contained model directory. Binary (.bin) files are explicitly excluded to avoid copying PyTorch checkpoint files.