Principle:Turboderp org Exllamav2 Dynamic Generator Setup
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Inference_Optimization, Concurrent_Batching, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The Dynamic Generator implements a paged-attention concurrent batching system that allows multiple generation requests to share a single KV cache efficiently through virtual memory-style page tables.
Description
Traditional LLM serving processes one request at a time, wasting GPU compute during memory-bound operations. The Dynamic Generator addresses this through several key innovations:
- Paged attention: Instead of allocating a contiguous KV cache per sequence, the cache is divided into fixed-size pages (similar to virtual memory). Each sequence maintains a page table that maps logical positions to physical cache pages. This eliminates memory fragmentation and allows efficient sharing of cache pages between sequences with common prefixes.
- Concurrent batching: Multiple generation requests are processed simultaneously in a single forward pass. The generator maintains a job queue and dynamically batches compatible jobs together, maximizing GPU utilization. Jobs can be added and removed dynamically without interrupting ongoing generation.
- Prefix cache deduplication: When multiple requests share the same prefix (e.g., a system prompt), their prefix tokens share the same physical cache pages. This dramatically reduces memory usage for workloads with common prefixes.
- Speculative decoding: The generator optionally supports speculative decoding using either a smaller draft model or n-gram prediction. Multiple candidate tokens are generated speculatively, then verified in a single forward pass of the main model, improving throughput for memory-bound scenarios.
- Dynamic scheduling: The generator schedules jobs based on available cache pages, prefill progress, and generation state. It handles the transition between prefill (processing the prompt) and decode (generating new tokens) phases automatically.
Usage
Use the Dynamic Generator when:
- Serving multiple concurrent users or requests
- Building an API server or chat application with batched inference
- Maximizing GPU throughput for high-volume text generation
- Benefiting from prefix cache deduplication (shared system prompts)
- Using speculative decoding for faster single-request throughput
Theoretical Basis
PagedAttention
# Traditional attention cache: contiguous allocation per sequence
# Wastes memory due to fragmentation and over-allocation
# Paged attention:
page_size = 256 tokens # (configurable)
total_pages = cache_memory / (page_size * kv_size_per_token)
# Each sequence has a page table:
sequence.page_table = [page_3, page_17, page_42, ...]
# Attention for position i looks up:
physical_page = sequence.page_table[i // page_size]
offset = i % page_size
k, v = cache[physical_page][offset]
Concurrent Batching
# Dynamic batching combines multiple sequences into one forward pass:
function iterate():
active_jobs = select_jobs_for_batch(max_batch_size)
# Build combined input tensor
input_ids = concatenate([job.next_tokens for job in active_jobs])
cache_indices = concatenate([job.cache_pages for job in active_jobs])
# Single forward pass for all jobs
logits = model.forward(input_ids, cache_indices)
# Distribute results
for job, job_logits in zip(active_jobs, split(logits)):
job.sample_and_update(job_logits)
Speculative Decoding
# Draft model generates k candidate tokens quickly
draft_tokens = draft_model.generate(context, k=num_draft_tokens)
# Main model verifies all k tokens in one forward pass
logits = main_model.forward(draft_tokens)
# Accept tokens that match, reject from first mismatch
accepted = verify_tokens(draft_tokens, logits)
# Typical acceptance rate: 70-90% for good draft models
Related Pages
Implemented By
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment