Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Init

Knowledge Sources	ExLlamaV2 Efficient Memory Management for Large Language Model Serving with PagedAttention
Domains	Inference_Optimization, Concurrent_Batching, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for initializing a paged-attention dynamic batching generator for concurrent LLM inference, provided by exllamav2.

Description

ExLlamaV2DynamicGenerator is the primary high-performance generator in exllamav2. Its constructor sets up:

Page table management: Divides the KV cache into fixed-size pages and creates a page allocation system for virtual memory-style cache management.
Job queue: Initializes the queue for managing concurrent generation requests with dynamic scheduling.
Batch size calculation: If max_batch_size is not specified, it is automatically calculated based on available cache pages and model configuration.
Speculative decoding: Optionally configures a draft model and draft cache for speculative decoding, or n-gram-based speculation.
Sampling thread pool: Creates a thread pool for parallel token sampling across multiple jobs.
Filter background evaluation: Pre-evaluates constrained decoding filters during idle GPU cycles.

After initialization, the generator is ready to accept generation requests via generate() (blocking, simple API) or via job-based API for fine-grained control.

Usage

Initialize ExLlamaV2DynamicGenerator after loading the model and allocating the cache. This is the recommended generator for:

Server-side inference with multiple concurrent requests
Applications that benefit from paged attention and prefix caching
Speculative decoding deployments
Any scenario requiring high throughput

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/generator/dynamic.py
Lines: L241-481

Signature

class ExLlamaV2DynamicGenerator:

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer,
        max_batch_size: int | None = None,
        max_seq_len: int | None = None,
        max_chunk_size: int | None = None,
        max_q_size: int = 8,
        draft_model: ExLlamaV2 | None = None,
        draft_cache: ExLlamaV2CacheBase | None = None,
        num_draft_tokens: int = 4,
        use_ngram_draft: bool = False,
        max_ngram: int = 4,
        max_sampling_threads: int = 16,
        min_sampling_threads: int = 3,
        paged: bool = True,
        filter_background_eval: bool = True,
        **kwargs,
    ):
        ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	Loaded model instance with weights on GPU(s)
cache	ExLlamaV2CacheBase	Yes	Allocated KV cache (FP16, Q4, Q6, or Q8)
tokenizer	ExLlamaV2Tokenizer	Yes	Initialized tokenizer for encoding/decoding
max_batch_size	int or None	No (default None)	Maximum concurrent sequences; None auto-calculates from cache capacity
max_seq_len	int or None	No (default None)	Maximum sequence length; None uses cache max_seq_len
max_chunk_size	int or None	No (default None)	Maximum tokens per prefill chunk; None uses a sensible default
max_q_size	int	No (default 8)	Maximum pending jobs in queue
draft_model	ExLlamaV2 or None	No (default None)	Smaller model for speculative decoding
draft_cache	ExLlamaV2CacheBase or None	No (default None)	Cache for the draft model
num_draft_tokens	int	No (default 4)	Number of tokens to speculate per step
use_ngram_draft	bool	No (default False)	Use n-gram-based speculation instead of draft model
max_ngram	int	No (default 4)	Maximum n-gram size for n-gram speculation
paged	bool	No (default True)	Enable paged attention (recommended)
filter_background_eval	bool	No (default True)	Pre-evaluate constrained decoding filters during idle cycles

Outputs

Name	Type	Description
generator instance	ExLlamaV2DynamicGenerator	Fully initialized generator ready for generate() or job-based API
generator.num_pages	int	Total number of cache pages available
generator.max_batch_size	int	Effective maximum batch size

Usage Examples

Basic Initialization

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
)

With Speculative Decoding

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator

# Load main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

# Load draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
    draft_model=draft_model,
    draft_cache=draft_cache,
    num_draft_tokens=5,
)

With N-gram Speculation

generator = ExLlamaV2DynamicGenerator(
    model=model,
    cache=cache,
    tokenizer=tokenizer,
    use_ngram_draft=True,
    max_ngram=4,
)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Dynamic_Generator_Setup

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment