Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2Config

From Leeroopedia
Knowledge Sources
Domains Model_Architecture, Configuration, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for parsing and initializing transformer model configuration from HuggingFace-format model directories, provided by exllamav2.

Description

ExLlamaV2Config is the configuration class that reads a model's config.json and associated files to establish all architecture parameters needed for inference. It handles:

  • Detecting the model architecture (Llama, Mistral, Qwen2, Gemma, Phi, DeepSeek, Cohere, etc.) from the architectures field
  • Parsing hidden dimensions, layer counts, attention head configurations, and vocabulary sizes
  • Reading RoPE (Rotary Position Embedding) settings including base frequency and scaling
  • Scanning and mapping safetensors weight files to expected tensor names
  • Handling EXL2 and GPTQ quantization metadata
  • Applying attention backend compatibility overrides for flash-attn, xformers, and SDPA

The configuration lifecycle follows three steps: construction, prepare() to read and parse all files, and optionally arch_compat_overrides() to adjust attention settings for the current hardware.

Usage

Use ExLlamaV2Config as the first step in any exllamav2 inference pipeline. Every model, cache, tokenizer, and generator depends on a properly initialized config.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/config.py
  • Lines: L167-197 (__init__), L210-626 (prepare), L629-676 (arch_compat_overrides)

Signature

class ExLlamaV2Config:

    def __init__(self, model_dir: str | None = None):
        ...

    def prepare(self, no_tensors: bool = False):
        ...

    def arch_compat_overrides(self, quiet: bool = False, warn_only: bool = False):
        ...

Import

from exllamav2 import ExLlamaV2Config

I/O Contract

Inputs

Name Type Required Description
model_dir str or None Yes (at init or before prepare) Path to HuggingFace/EXL2/GPTQ model directory containing config.json and safetensors files
no_tensors bool No (default False) If True, skip scanning tensor files; useful for inspecting config without loading weights
quiet bool No (default False) Suppress compatibility override messages
warn_only bool No (default False) Show warnings instead of raising errors on compatibility issues

Outputs

Name Type Description
config instance ExLlamaV2Config Fully initialized configuration object with the following key attributes:
config.hidden_size int Model hidden dimension (d_model)
config.num_hidden_layers int Number of transformer layers
config.num_attention_heads int Number of attention heads
config.num_key_value_heads int Number of key-value heads (for GQA)
config.vocab_size int Vocabulary size
config.max_seq_len int Maximum sequence length
config.architecture str Detected model architecture name
config.tensor_file_map dict Mapping of tensor names to file paths
config.rope_theta float RoPE base frequency

Usage Examples

Basic Configuration

from exllamav2 import ExLlamaV2Config

# Initialize and prepare config
config = ExLlamaV2Config("/path/to/model")
config.prepare()

# Access architecture parameters
print(f"Architecture: {config.architecture}")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
print(f"Vocab size: {config.vocab_size}")

Configuration with Compatibility Overrides

from exllamav2 import ExLlamaV2Config

config = ExLlamaV2Config("/path/to/model")
config.prepare()

# Apply attention backend overrides for current hardware
config.arch_compat_overrides(quiet=True)

# Optionally override max sequence length
config.max_seq_len = 4096

Inspect Config Without Loading Tensors

from exllamav2 import ExLlamaV2Config

config = ExLlamaV2Config("/path/to/model")
config.prepare(no_tensors=True)

print(f"Model has {config.num_hidden_layers} layers")
print(f"Head dim: {config.hidden_size // config.num_attention_heads}")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment