Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Init

From Leeroopedia
Knowledge Sources
Domains Inference_Optimization, Concurrent_Batching, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for initializing a paged-attention dynamic batching generator for concurrent LLM inference, provided by exllamav2.

Description

ExLlamaV2DynamicGenerator is the primary high-performance generator in exllamav2. Its constructor sets up:

  • Page table management: Divides the KV cache into fixed-size pages and creates a page allocation system for virtual memory-style cache management.
  • Job queue: Initializes the queue for managing concurrent generation requests with dynamic scheduling.
  • Batch size calculation: If max_batch_size is not specified, it is automatically calculated based on available cache pages and model configuration.
  • Speculative decoding: Optionally configures a draft model and draft cache for speculative decoding, or n-gram-based speculation.
  • Sampling thread pool: Creates a thread pool for parallel token sampling across multiple jobs.
  • Filter background evaluation: Pre-evaluates constrained decoding filters during idle GPU cycles.

After initialization, the generator is ready to accept generation requests via generate() (blocking, simple API) or via job-based API for fine-grained control.

Usage

Initialize ExLlamaV2DynamicGenerator after loading the model and allocating the cache. This is the recommended generator for:

  • Server-side inference with multiple concurrent requests
  • Applications that benefit from paged attention and prefix caching
  • Speculative decoding deployments
  • Any scenario requiring high throughput

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/generator/dynamic.py
  • Lines: L241-481

Signature

class ExLlamaV2DynamicGenerator:

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer,
        max_batch_size: int | None = None,
        max_seq_len: int | None = None,
        max_chunk_size: int | None = None,
        max_q_size: int = 8,
        draft_model: ExLlamaV2 | None = None,
        draft_cache: ExLlamaV2CacheBase | None = None,
        num_draft_tokens: int = 4,
        use_ngram_draft: bool = False,
        max_ngram: int = 4,
        max_sampling_threads: int = 16,
        min_sampling_threads: int = 3,
        paged: bool = True,
        filter_background_eval: bool = True,
        **kwargs,
    ):
        ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes Loaded model instance with weights on GPU(s)
cache ExLlamaV2CacheBase Yes Allocated KV cache (FP16, Q4, Q6, or Q8)
tokenizer ExLlamaV2Tokenizer Yes Initialized tokenizer for encoding/decoding
max_batch_size int or None No (default None) Maximum concurrent sequences; None auto-calculates from cache capacity
max_seq_len int or None No (default None) Maximum sequence length; None uses cache max_seq_len
max_chunk_size int or None No (default None) Maximum tokens per prefill chunk; None uses a sensible default
max_q_size int No (default 8) Maximum pending jobs in queue
draft_model ExLlamaV2 or None No (default None) Smaller model for speculative decoding
draft_cache ExLlamaV2CacheBase or None No (default None) Cache for the draft model
num_draft_tokens int No (default 4) Number of tokens to speculate per step
use_ngram_draft bool No (default False) Use n-gram-based speculation instead of draft model
max_ngram int No (default 4) Maximum n-gram size for n-gram speculation
paged bool No (default True) Enable paged attention (recommended)
filter_background_eval bool No (default True) Pre-evaluate constrained decoding filters during idle cycles

Outputs

Name Type Description
generator instance ExLlamaV2DynamicGenerator Fully initialized generator ready for generate() or job-based API
generator.num_pages int Total number of cache pages available
generator.max_batch_size int Effective maximum batch size

Usage Examples

Basic Initialization

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
)

With Speculative Decoding

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator

# Load main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

# Load draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
    draft_model=draft_model,
    draft_cache=draft_cache,
    num_draft_tokens=5,
)

With N-gram Speculation

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
    use_ngram_draft=True,
    max_ngram=4,
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment