Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 Compile Model

From Leeroopedia
Knowledge Sources
Domains Quantization, Model_Serialization
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for assembling individually quantized layer tensors and non-quantized parameters into a final distributable EXL2 model provided by exllamav2.

Description

The compile_model function iterates through all model modules in order, loading either the quantized tensor file (for linear layers) or the float-precision weight (for embeddings, norms, routers). It accumulates tensors into an output buffer and writes sharded safetensors files whenever the buffer exceeds the configured shard size. After all model tensors are written, it optionally copies multimodal/vision tensors, copies non-tensor files from the source model directory, and updates config.json with quantization metadata.

Usage

Call compile_model as the final step of the EXL2 conversion pipeline, after all layers have been quantized by quant.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/conversion/compile.py
  • Lines: L61-295

Supporting Functions

def get_f_module(job, module):
    """Load a non-quantized module's float weight(s) as a dict."""

def get_q_module(job, module):
    """Load a quantized module's packed tensors from its safetensors file."""

Signature

@torch.inference_mode()
def compile_model(job, save_fn, model):

Import

from exllamav2.conversion.compile import compile_model

I/O Contract

Inputs

Name Type Required Description
job dict Yes Conversion job state. Key fields: job["out_dir"] (working directory containing out_tensor/ subdirectory with per-module safetensors), job["shard_size"] (max shard size in MB, default 8192), job["compile_full"] (output directory path for full model compilation, or None), job["cal_dataset"] (calibration dataset name for metadata), job["bits"], job["head_bits"], job["dataset_rows"], job["length"]
save_fn callable Yes Callback to persist job state
model ExLlamaV2 Yes The loaded model instance (used to iterate modules and retrieve architecture config)

Outputs

Name Type Description
output.safetensors File Single output file if total size fits in one shard; saved to job["compile_full"] or job["out_dir"]
output-NNNNN-of-NNNNN.safetensors Files Sharded output files if total size exceeds shard_size
config.json File (updated) Model config with added quantization_config section (only when compile_full is set)
Non-tensor files Files (copied) Tokenizer files, generation config, etc. copied from source model (only when compile_full is set)

Module Processing Order

The function processes modules in the model's native order, handling each type differently:

Module Type Treatment Source
ExLlamaV2Embedding Float weight get_f_module
ExLlamaV2PosEmbedding Float weight get_f_module
ExLlamaV2Attention Float norms + quantized Q/K/V/O projections + optional Q/K norms get_f_module + get_q_module
ExLlamaV2MLP Float norms + quantized gate/up/down projections get_f_module + get_q_module
ExLlamaV2MoEMLP Float norm + float router + quantized w1/w3/w2 per expert get_f_module + get_q_module
ExLlamaV2ParallelDecoder Float norm + quantized attn projections + quantized MLP projections get_f_module + get_q_module
ExLlamaV2RMSNorm / ExLlamaV2LayerNorm Float weight get_f_module
ExLlamaV2Linear (lm_head) Quantized weight get_q_module

Sharding Algorithm

shard_bytes = job["shard_size"] * 1024 ** 2  # Convert MB to bytes

# Accumulate tensors into out_dict
# When current_size > shard_bytes:
#   Split out_dict into save_dict (fits in shard) and dont_save_dict (overflow)
#   Write save_dict to output_temp_{file_index}.safetensors
#   Continue with dont_save_dict as the new out_dict

# After all modules processed, rename temp files:
#   Single file: output.safetensors
#   Multiple files: output-00001-of-NNNNN.safetensors, etc.

Usage Examples

Basic Example

from exllamav2.conversion.compile import compile_model

# After quant() has written all per-module safetensors
job["shard_size"] = 8192          # 8 GB per shard
job["compile_full"] = "/output/my_model_exl2_4.125bpw"

compile_model(job, save_fn, model)

# Result: /output/my_model_exl2_4.125bpw/ contains:
#   output.safetensors (or sharded output-NNNNN-of-NNNNN.safetensors)
#   config.json (with quantization_config)
#   tokenizer.json, tokenizer_config.json, etc.

Output Config Metadata

import json

with open("/output/my_model_exl2_4.125bpw/config.json") as f:
    config = json.load(f)

print(config["quantization_config"])
# {
#     "quant_method": "exl2",
#     "version": "0.2.6",
#     "bits": 4.125,
#     "head_bits": 6,
#     "calibration": {
#         "rows": 100,
#         "length": 2048,
#         "dataset": "(default)"
#     }
# }

Dependencies

  • torch -- tensor operations, inference mode
  • safetensors -- reading per-module quantized tensors and writing final sharded output
  • json -- reading and updating config.json
  • shutil -- copying non-tensor files from the source model directory
  • os, glob -- file system operations, listing tensor files
  • exllamav2.version.__version__ -- version string injected into quantization metadata

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment