Implementation:Turboderp org Exllamav2 Compile Model

Knowledge Sources	ExLlamaV2
Domains	Quantization, Model_Serialization
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for assembling individually quantized layer tensors and non-quantized parameters into a final distributable EXL2 model provided by exllamav2.

Description

The compile_model function iterates through all model modules in order, loading either the quantized tensor file (for linear layers) or the float-precision weight (for embeddings, norms, routers). It accumulates tensors into an output buffer and writes sharded safetensors files whenever the buffer exceeds the configured shard size. After all model tensors are written, it optionally copies multimodal/vision tensors, copies non-tensor files from the source model directory, and updates config.json with quantization metadata.

Usage

Call compile_model as the final step of the EXL2 conversion pipeline, after all layers have been quantized by quant.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/conversion/compile.py
Lines: L61-295

Supporting Functions

def get_f_module(job, module):
    """Load a non-quantized module's float weight(s) as a dict."""

def get_q_module(job, module):
    """Load a quantized module's packed tensors from its safetensors file."""

Signature

@torch.inference_mode()
def compile_model(job, save_fn, model):

Import

from exllamav2.conversion.compile import compile_model

I/O Contract

Inputs

Name	Type	Required	Description
job	dict	Yes	Conversion job state. Key fields: `job["out_dir"]` (working directory containing `out_tensor/` subdirectory with per-module safetensors), `job["shard_size"]` (max shard size in MB, default 8192), `job["compile_full"]` (output directory path for full model compilation, or None), `job["cal_dataset"]` (calibration dataset name for metadata), `job["bits"]`, `job["head_bits"]`, `job["dataset_rows"]`, `job["length"]`
save_fn	callable	Yes	Callback to persist job state
model	ExLlamaV2	Yes	The loaded model instance (used to iterate modules and retrieve architecture config)

Outputs

Name	Type	Description
output.safetensors	File	Single output file if total size fits in one shard; saved to `job["compile_full"]` or `job["out_dir"]`
output-NNNNN-of-NNNNN.safetensors	Files	Sharded output files if total size exceeds `shard_size`
config.json	File (updated)	Model config with added `quantization_config` section (only when `compile_full` is set)
Non-tensor files	Files (copied)	Tokenizer files, generation config, etc. copied from source model (only when `compile_full` is set)

Module Processing Order

The function processes modules in the model's native order, handling each type differently:

Module Type	Treatment	Source
ExLlamaV2Embedding	Float weight	`get_f_module`
ExLlamaV2PosEmbedding	Float weight	`get_f_module`
ExLlamaV2Attention	Float norms + quantized Q/K/V/O projections + optional Q/K norms	`get_f_module` + `get_q_module`
ExLlamaV2MLP	Float norms + quantized gate/up/down projections	`get_f_module` + `get_q_module`
ExLlamaV2MoEMLP	Float norm + float router + quantized w1/w3/w2 per expert	`get_f_module` + `get_q_module`
ExLlamaV2ParallelDecoder	Float norm + quantized attn projections + quantized MLP projections	`get_f_module` + `get_q_module`
ExLlamaV2RMSNorm / ExLlamaV2LayerNorm	Float weight	`get_f_module`
ExLlamaV2Linear (lm_head)	Quantized weight	`get_q_module`

Sharding Algorithm

shard_bytes = job["shard_size"] * 1024 ** 2  # Convert MB to bytes

# Accumulate tensors into out_dict
# When current_size > shard_bytes:
#   Split out_dict into save_dict (fits in shard) and dont_save_dict (overflow)
#   Write save_dict to output_temp_{file_index}.safetensors
#   Continue with dont_save_dict as the new out_dict

# After all modules processed, rename temp files:
#   Single file: output.safetensors
#   Multiple files: output-00001-of-NNNNN.safetensors, etc.

Usage Examples

Basic Example

from exllamav2.conversion.compile import compile_model

# After quant() has written all per-module safetensors
job["shard_size"] = 8192          # 8 GB per shard
job["compile_full"] = "/output/my_model_exl2_4.125bpw"

compile_model(job, save_fn, model)

# Result: /output/my_model_exl2_4.125bpw/ contains:
#   output.safetensors (or sharded output-NNNNN-of-NNNNN.safetensors)
#   config.json (with quantization_config)
#   tokenizer.json, tokenizer_config.json, etc.

Output Config Metadata

import json

with open("/output/my_model_exl2_4.125bpw/config.json") as f:
    config = json.load(f)

print(config["quantization_config"])
# {
#     "quant_method": "exl2",
#     "version": "0.2.6",
#     "bits": 4.125,
#     "head_bits": 6,
#     "calibration": {
#         "rows": 100,
#         "length": 2048,
#         "dataset": "(default)"
#     }
# }

Dependencies

torch -- tensor operations, inference mode
safetensors -- reading per-module quantized tensors and writing final sharded output
json -- reading and updating config.json
shutil -- copying non-tensor files from the source model directory
os, glob -- file system operations, listing tensor files
exllamav2.version.__version__ -- version string injected into quantization metadata

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Model_Compilation

Requires Environment

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Quantization_Conversion_Tips

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment