Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Inference_Optimization, Concurrent_Batching, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for initializing a paged-attention dynamic batching generator for concurrent LLM inference, provided by exllamav2.
Description
ExLlamaV2DynamicGenerator is the primary high-performance generator in exllamav2. Its constructor sets up:
- Page table management: Divides the KV cache into fixed-size pages and creates a page allocation system for virtual memory-style cache management.
- Job queue: Initializes the queue for managing concurrent generation requests with dynamic scheduling.
- Batch size calculation: If max_batch_size is not specified, it is automatically calculated based on available cache pages and model configuration.
- Speculative decoding: Optionally configures a draft model and draft cache for speculative decoding, or n-gram-based speculation.
- Sampling thread pool: Creates a thread pool for parallel token sampling across multiple jobs.
- Filter background evaluation: Pre-evaluates constrained decoding filters during idle GPU cycles.
After initialization, the generator is ready to accept generation requests via generate() (blocking, simple API) or via job-based API for fine-grained control.
Usage
Initialize ExLlamaV2DynamicGenerator after loading the model and allocating the cache. This is the recommended generator for:
- Server-side inference with multiple concurrent requests
- Applications that benefit from paged attention and prefix caching
- Speculative decoding deployments
- Any scenario requiring high throughput
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/dynamic.py
- Lines: L241-481
Signature
class ExLlamaV2DynamicGenerator:
def __init__(
self,
model: ExLlamaV2,
cache: ExLlamaV2CacheBase,
tokenizer: ExLlamaV2Tokenizer,
max_batch_size: int | None = None,
max_seq_len: int | None = None,
max_chunk_size: int | None = None,
max_q_size: int = 8,
draft_model: ExLlamaV2 | None = None,
draft_cache: ExLlamaV2CacheBase | None = None,
num_draft_tokens: int = 4,
use_ngram_draft: bool = False,
max_ngram: int = 4,
max_sampling_threads: int = 16,
min_sampling_threads: int = 3,
paged: bool = True,
filter_background_eval: bool = True,
**kwargs,
):
...
Import
from exllamav2.generator import ExLlamaV2DynamicGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | Loaded model instance with weights on GPU(s) |
| cache | ExLlamaV2CacheBase | Yes | Allocated KV cache (FP16, Q4, Q6, or Q8) |
| tokenizer | ExLlamaV2Tokenizer | Yes | Initialized tokenizer for encoding/decoding |
| max_batch_size | int or None | No (default None) | Maximum concurrent sequences; None auto-calculates from cache capacity |
| max_seq_len | int or None | No (default None) | Maximum sequence length; None uses cache max_seq_len |
| max_chunk_size | int or None | No (default None) | Maximum tokens per prefill chunk; None uses a sensible default |
| max_q_size | int | No (default 8) | Maximum pending jobs in queue |
| draft_model | ExLlamaV2 or None | No (default None) | Smaller model for speculative decoding |
| draft_cache | ExLlamaV2CacheBase or None | No (default None) | Cache for the draft model |
| num_draft_tokens | int | No (default 4) | Number of tokens to speculate per step |
| use_ngram_draft | bool | No (default False) | Use n-gram-based speculation instead of draft model |
| max_ngram | int | No (default 4) | Maximum n-gram size for n-gram speculation |
| paged | bool | No (default True) | Enable paged attention (recommended) |
| filter_background_eval | bool | No (default True) | Pre-evaluate constrained decoding filters during idle cycles |
Outputs
| Name | Type | Description |
|---|---|---|
| generator instance | ExLlamaV2DynamicGenerator | Fully initialized generator ready for generate() or job-based API |
| generator.num_pages | int | Total number of cache pages available |
| generator.max_batch_size | int | Effective maximum batch size |
Usage Examples
Basic Initialization
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2DynamicGenerator(
model=model,
cache=cache,
tokenizer=tokenizer,
)
With Speculative Decoding
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator
# Load main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
# Load draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2DynamicGenerator(
model=model,
cache=cache,
tokenizer=tokenizer,
draft_model=draft_model,
draft_cache=draft_cache,
num_draft_tokens=5,
)
With N-gram Speculation
generator = ExLlamaV2DynamicGenerator(
model=model,
cache=cache,
tokenizer=tokenizer,
use_ngram_draft=True,
max_ngram=4,
)
Related Pages
Implements Principle
Requires Environment
- Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime
- Environment:Turboderp_org_Exllamav2_Flash_Attention_Backend
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment