Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Turboderp org Exllamav2 EXL2 Model Conversion

From Leeroopedia
Knowledge Sources
Domains LLMs, Quantization, Model_Optimization
Last Updated 2026-02-15 00:00 GMT

Overview

End-to-end process for converting HuggingFace FP16 language models into the EXL2 mixed-bitwidth quantized format for efficient inference on consumer GPUs.

Description

This workflow covers the complete EXL2 quantization pipeline, which converts large language models from their original FP16 weights into a compact, mixed-precision format. EXL2 extends GPTQ-style quantization by supporting adaptive per-layer bit allocation (2 to 8 bits per weight) using simulated annealing to minimize overall quantization error while meeting a target average bitrate. The process involves calibration data tokenization, per-layer sensitivity measurement, bit allocation optimization, actual quantization, and final compilation into sharded safetensors files. The pipeline supports resumption after interruption through a state machine with disk-persisted checkpoints.

Usage

Execute this workflow when you have a HuggingFace FP16 language model (with config.json, tokenizer files, and .safetensors weight files) and need to produce a quantized version for fast inference with ExLlamaV2. Typical use cases include compressing 7B to 70B+ parameter models to fit on consumer GPUs with limited VRAM (8-24 GB). Hardware requirements are approximately 8 GB VRAM and 16 GB RAM for 7B models, scaling to 24 GB VRAM and 64 GB RAM for 70B models.

Execution Steps

Step 1: Environment_Setup

Ensure the ExLlamaV2 package is installed with its CUDA extensions compiled. Verify that the CUDA Toolkit and PyTorch are available. Confirm that the source model directory contains valid HuggingFace model files (config.json, tokenizer files, and one or more .safetensors weight files). Prepare an empty working directory for intermediate conversion artifacts and optionally a separate output directory for the final compiled model.

Key considerations:

  • The CUDA extension must be compiled for the target GPU architecture
  • Sufficient system RAM and VRAM must be available (scales with model width, not depth)
  • Sharded input models are supported automatically

Step 2: Calibration_Tokenization

Tokenize calibration data into fixed-length sequences for use during measurement and quantization. If no custom calibration dataset is provided (Parquet format), a built-in default dataset is used that covers a broad mix of data types to prevent overfitting to any particular domain. Two separate tokenization passes are performed: one for the measurement phase (fewer rows, configurable length) and one for the quantization phase (more rows for better accuracy).

Key considerations:

  • Default calibration dataset provides robust, general-purpose coverage
  • Measurement uses 16 rows at 2048 tokens by default
  • Quantization uses 100 rows at 2048 tokens by default
  • Custom Parquet datasets can be substituted for domain-specific quantization

Step 3: Sensitivity_Measurement

Measure the quantization sensitivity of each layer in the model by quantizing every linear layer multiple times at different bit widths and recording the resulting error (perplexity impact). This pass effectively quantizes the entire model approximately 12 times over using a subset of the calibration data. The output is a measurement.json file that maps each layer to its error profile across different quantization settings.

Key considerations:

  • This is the slowest step in the pipeline and can be saved/reused across multiple quantizations of the same model
  • The measurement can be exported separately using the output_measurement flag
  • Supports graceful interruption and resumption via checkpoint state
  • Calibration noise rows are added for architectures that require them

Step 4: Bit_Allocation_Optimization

Using the measurement data, solve for the optimal per-layer quantization parameters that minimize the maximum quantization error across all layers while achieving the target average bitrate. The optimizer uses a constrained optimization approach to distribute bits unevenly across layers based on their measured sensitivity, allocating more bits to sensitive layers and fewer to robust ones.

Key considerations:

  • Target bitrate can range from 2.0 to 8.0 bits per weight
  • The head (output) layer has a separate configurable bitrate (default 6 bits)
  • The solving step may appear to hang but is performing optimization
  • Within a single layer, columns can be quantized at different bit widths (sparse-like mixed precision)

Step 5: Quantization

Apply the optimized quantization parameters to each layer of the model. This pass loads the original FP16 weights, applies GPTQ-style quantization with the selected bit widths per layer, and writes the quantized tensors to intermediate output files. The process uses the full calibration dataset (100 rows) for better accuracy than the measurement pass.

Key considerations:

  • Uses Adaptive GPTQ with act-order for column reordering
  • Quantized weights are written as intermediate tensors to the working directory
  • A calibration perplexity check is performed to validate quantization quality
  • Perplexity above 30 suggests poor quantization; above 1000 indicates failure

Step 6: Compilation

Assemble the quantized layer tensors into final sharded .safetensors files. If a compile-full directory is specified, all non-weight files from the original model (config, tokenizer, etc.) are copied alongside the quantized weights to produce a complete, self-contained model directory ready for inference.

Key considerations:

  • Default shard size is 8192 MB; set to 0 for a single output file
  • Very large single files require significant system RAM during writing
  • The output directory can be used directly with ExLlamaV2 for inference
  • Original model metadata and tokenizer files are preserved

Execution Diagram

GitHub URL

Workflow Repository