Implementation:Turboderp org Exllamav2 ExLlamaV2Config

Knowledge Sources	ExLlamaV2
Domains	Model_Architecture, Configuration, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for parsing and initializing transformer model configuration from HuggingFace-format model directories, provided by exllamav2.

Description

ExLlamaV2Config is the configuration class that reads a model's config.json and associated files to establish all architecture parameters needed for inference. It handles:

Detecting the model architecture (Llama, Mistral, Qwen2, Gemma, Phi, DeepSeek, Cohere, etc.) from the architectures field
Parsing hidden dimensions, layer counts, attention head configurations, and vocabulary sizes
Reading RoPE (Rotary Position Embedding) settings including base frequency and scaling
Scanning and mapping safetensors weight files to expected tensor names
Handling EXL2 and GPTQ quantization metadata
Applying attention backend compatibility overrides for flash-attn, xformers, and SDPA

The configuration lifecycle follows three steps: construction, prepare() to read and parse all files, and optionally arch_compat_overrides() to adjust attention settings for the current hardware.

Usage

Use ExLlamaV2Config as the first step in any exllamav2 inference pipeline. Every model, cache, tokenizer, and generator depends on a properly initialized config.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/config.py
Lines: L167-197 (__init__), L210-626 (prepare), L629-676 (arch_compat_overrides)

Signature

class ExLlamaV2Config:

    def __init__(self, model_dir: str | None = None):
        ...

    def prepare(self, no_tensors: bool = False):
        ...

    def arch_compat_overrides(self, quiet: bool = False, warn_only: bool = False):
        ...

Import

from exllamav2 import ExLlamaV2Config

I/O Contract

Inputs

Name	Type	Required	Description
model_dir	str or None	Yes (at init or before prepare)	Path to HuggingFace/EXL2/GPTQ model directory containing config.json and safetensors files
no_tensors	bool	No (default False)	If True, skip scanning tensor files; useful for inspecting config without loading weights
quiet	bool	No (default False)	Suppress compatibility override messages
warn_only	bool	No (default False)	Show warnings instead of raising errors on compatibility issues

Outputs

Name	Type	Description
config instance	ExLlamaV2Config	Fully initialized configuration object with the following key attributes:
config.hidden_size	int	Model hidden dimension (d_model)
config.num_hidden_layers	int	Number of transformer layers
config.num_attention_heads	int	Number of attention heads
config.num_key_value_heads	int	Number of key-value heads (for GQA)
config.vocab_size	int	Vocabulary size
config.max_seq_len	int	Maximum sequence length
config.architecture	str	Detected model architecture name
config.tensor_file_map	dict	Mapping of tensor names to file paths
config.rope_theta	float	RoPE base frequency

Usage Examples

Basic Configuration

from exllamav2 import ExLlamaV2Config

# Initialize and prepare config
config = ExLlamaV2Config("/path/to/model")
config.prepare()

# Access architecture parameters
print(f"Architecture: {config.architecture}")
print(f"Hidden size: {config.hidden_size}")
print(f"Layers: {config.num_hidden_layers}")
print(f"Vocab size: {config.vocab_size}")

Configuration with Compatibility Overrides

from exllamav2 import ExLlamaV2Config

config = ExLlamaV2Config("/path/to/model")
config.prepare()

# Apply attention backend overrides for current hardware
config.arch_compat_overrides(quiet=True)

# Optionally override max sequence length
config.max_seq_len = 4096

Inspect Config Without Loading Tensors

from exllamav2 import ExLlamaV2Config

config = ExLlamaV2Config("/path/to/model")
config.prepare(no_tensors=True)

print(f"Model has {config.num_hidden_layers} layers")
print(f"Head dim: {config.hidden_size // config.num_attention_heads}")

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Model_Configuration

Requires Environment

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Memory_Optimization_Techniques

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment