Implementation:Turboderp org Exllamav2 Compile Model
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Serialization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for assembling individually quantized layer tensors and non-quantized parameters into a final distributable EXL2 model provided by exllamav2.
Description
The compile_model function iterates through all model modules in order, loading either the quantized tensor file (for linear layers) or the float-precision weight (for embeddings, norms, routers). It accumulates tensors into an output buffer and writes sharded safetensors files whenever the buffer exceeds the configured shard size. After all model tensors are written, it optionally copies multimodal/vision tensors, copies non-tensor files from the source model directory, and updates config.json with quantization metadata.
Usage
Call compile_model as the final step of the EXL2 conversion pipeline, after all layers have been quantized by quant.
Code Reference
Source Location
- Repository: exllamav2
- File:
exllamav2/conversion/compile.py - Lines: L61-295
Supporting Functions
def get_f_module(job, module):
"""Load a non-quantized module's float weight(s) as a dict."""
def get_q_module(job, module):
"""Load a quantized module's packed tensors from its safetensors file."""
Signature
@torch.inference_mode()
def compile_model(job, save_fn, model):
Import
from exllamav2.conversion.compile import compile_model
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| job | dict | Yes | Conversion job state. Key fields: job["out_dir"] (working directory containing out_tensor/ subdirectory with per-module safetensors), job["shard_size"] (max shard size in MB, default 8192), job["compile_full"] (output directory path for full model compilation, or None), job["cal_dataset"] (calibration dataset name for metadata), job["bits"], job["head_bits"], job["dataset_rows"], job["length"]
|
| save_fn | callable | Yes | Callback to persist job state |
| model | ExLlamaV2 | Yes | The loaded model instance (used to iterate modules and retrieve architecture config) |
Outputs
| Name | Type | Description |
|---|---|---|
| output.safetensors | File | Single output file if total size fits in one shard; saved to job["compile_full"] or job["out_dir"]
|
| output-NNNNN-of-NNNNN.safetensors | Files | Sharded output files if total size exceeds shard_size
|
| config.json | File (updated) | Model config with added quantization_config section (only when compile_full is set)
|
| Non-tensor files | Files (copied) | Tokenizer files, generation config, etc. copied from source model (only when compile_full is set)
|
Module Processing Order
The function processes modules in the model's native order, handling each type differently:
| Module Type | Treatment | Source |
|---|---|---|
| ExLlamaV2Embedding | Float weight | get_f_module
|
| ExLlamaV2PosEmbedding | Float weight | get_f_module
|
| ExLlamaV2Attention | Float norms + quantized Q/K/V/O projections + optional Q/K norms | get_f_module + get_q_module
|
| ExLlamaV2MLP | Float norms + quantized gate/up/down projections | get_f_module + get_q_module
|
| ExLlamaV2MoEMLP | Float norm + float router + quantized w1/w3/w2 per expert | get_f_module + get_q_module
|
| ExLlamaV2ParallelDecoder | Float norm + quantized attn projections + quantized MLP projections | get_f_module + get_q_module
|
| ExLlamaV2RMSNorm / ExLlamaV2LayerNorm | Float weight | get_f_module
|
| ExLlamaV2Linear (lm_head) | Quantized weight | get_q_module
|
Sharding Algorithm
shard_bytes = job["shard_size"] * 1024 ** 2 # Convert MB to bytes
# Accumulate tensors into out_dict
# When current_size > shard_bytes:
# Split out_dict into save_dict (fits in shard) and dont_save_dict (overflow)
# Write save_dict to output_temp_{file_index}.safetensors
# Continue with dont_save_dict as the new out_dict
# After all modules processed, rename temp files:
# Single file: output.safetensors
# Multiple files: output-00001-of-NNNNN.safetensors, etc.
Usage Examples
Basic Example
from exllamav2.conversion.compile import compile_model
# After quant() has written all per-module safetensors
job["shard_size"] = 8192 # 8 GB per shard
job["compile_full"] = "/output/my_model_exl2_4.125bpw"
compile_model(job, save_fn, model)
# Result: /output/my_model_exl2_4.125bpw/ contains:
# output.safetensors (or sharded output-NNNNN-of-NNNNN.safetensors)
# config.json (with quantization_config)
# tokenizer.json, tokenizer_config.json, etc.
Output Config Metadata
import json
with open("/output/my_model_exl2_4.125bpw/config.json") as f:
config = json.load(f)
print(config["quantization_config"])
# {
# "quant_method": "exl2",
# "version": "0.2.6",
# "bits": 4.125,
# "head_bits": 6,
# "calibration": {
# "rows": 100,
# "length": 2048,
# "dataset": "(default)"
# }
# }
Dependencies
- torch -- tensor operations, inference mode
- safetensors -- reading per-module quantized tensors and writing final sharded output
- json -- reading and updating
config.json - shutil -- copying non-tensor files from the source model directory
- os, glob -- file system operations, listing tensor files
- exllamav2.version.__version__ -- version string injected into quantization metadata
Related Pages
Implements Principle
Requires Environment
- Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime
- Environment:Turboderp_org_Exllamav2_Build_Toolchain