Principle:Turboderp org Exllamav2 Dynamic Generator Setup

Knowledge Sources	ExLlamaV2 Efficient Memory Management for Large Language Model Serving with PagedAttention
Domains	Inference_Optimization, Concurrent_Batching, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

The Dynamic Generator implements a paged-attention concurrent batching system that allows multiple generation requests to share a single KV cache efficiently through virtual memory-style page tables.

Description

Traditional LLM serving processes one request at a time, wasting GPU compute during memory-bound operations. The Dynamic Generator addresses this through several key innovations:

Paged attention: Instead of allocating a contiguous KV cache per sequence, the cache is divided into fixed-size pages (similar to virtual memory). Each sequence maintains a page table that maps logical positions to physical cache pages. This eliminates memory fragmentation and allows efficient sharing of cache pages between sequences with common prefixes.

Concurrent batching: Multiple generation requests are processed simultaneously in a single forward pass. The generator maintains a job queue and dynamically batches compatible jobs together, maximizing GPU utilization. Jobs can be added and removed dynamically without interrupting ongoing generation.

Prefix cache deduplication: When multiple requests share the same prefix (e.g., a system prompt), their prefix tokens share the same physical cache pages. This dramatically reduces memory usage for workloads with common prefixes.

Speculative decoding: The generator optionally supports speculative decoding using either a smaller draft model or n-gram prediction. Multiple candidate tokens are generated speculatively, then verified in a single forward pass of the main model, improving throughput for memory-bound scenarios.

Dynamic scheduling: The generator schedules jobs based on available cache pages, prefill progress, and generation state. It handles the transition between prefill (processing the prompt) and decode (generating new tokens) phases automatically.

Usage

Use the Dynamic Generator when:

Serving multiple concurrent users or requests
Building an API server or chat application with batched inference
Maximizing GPU throughput for high-volume text generation
Benefiting from prefix cache deduplication (shared system prompts)
Using speculative decoding for faster single-request throughput

Theoretical Basis

PagedAttention

# Traditional attention cache: contiguous allocation per sequence
# Wastes memory due to fragmentation and over-allocation

# Paged attention:
page_size = 256 tokens  # (configurable)
total_pages = cache_memory / (page_size * kv_size_per_token)

# Each sequence has a page table:
sequence.page_table = [page_3, page_17, page_42, ...]

# Attention for position i looks up:
physical_page = sequence.page_table[i // page_size]
offset = i % page_size
k, v = cache[physical_page][offset]

Concurrent Batching

# Dynamic batching combines multiple sequences into one forward pass:
function iterate():
    active_jobs = select_jobs_for_batch(max_batch_size)

    # Build combined input tensor
    input_ids = concatenate([job.next_tokens for job in active_jobs])
    cache_indices = concatenate([job.cache_pages for job in active_jobs])

    # Single forward pass for all jobs
    logits = model.forward(input_ids, cache_indices)

    # Distribute results
    for job, job_logits in zip(active_jobs, split(logits)):
        job.sample_and_update(job_logits)

Speculative Decoding

# Draft model generates k candidate tokens quickly
draft_tokens = draft_model.generate(context, k=num_draft_tokens)

# Main model verifies all k tokens in one forward pass
logits = main_model.forward(draft_tokens)

# Accept tokens that match, reject from first mismatch
accepted = verify_tokens(draft_tokens, logits)
# Typical acceptance rate: 70-90% for good draft models

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_ExLlamaV2DynamicGenerator_Init

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment