Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Dynamic Generator Setup

From Leeroopedia
Knowledge Sources
Domains Inference_Optimization, Concurrent_Batching, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

The Dynamic Generator implements a paged-attention concurrent batching system that allows multiple generation requests to share a single KV cache efficiently through virtual memory-style page tables.

Description

Traditional LLM serving processes one request at a time, wasting GPU compute during memory-bound operations. The Dynamic Generator addresses this through several key innovations:

  • Paged attention: Instead of allocating a contiguous KV cache per sequence, the cache is divided into fixed-size pages (similar to virtual memory). Each sequence maintains a page table that maps logical positions to physical cache pages. This eliminates memory fragmentation and allows efficient sharing of cache pages between sequences with common prefixes.
  • Concurrent batching: Multiple generation requests are processed simultaneously in a single forward pass. The generator maintains a job queue and dynamically batches compatible jobs together, maximizing GPU utilization. Jobs can be added and removed dynamically without interrupting ongoing generation.
  • Prefix cache deduplication: When multiple requests share the same prefix (e.g., a system prompt), their prefix tokens share the same physical cache pages. This dramatically reduces memory usage for workloads with common prefixes.
  • Speculative decoding: The generator optionally supports speculative decoding using either a smaller draft model or n-gram prediction. Multiple candidate tokens are generated speculatively, then verified in a single forward pass of the main model, improving throughput for memory-bound scenarios.
  • Dynamic scheduling: The generator schedules jobs based on available cache pages, prefill progress, and generation state. It handles the transition between prefill (processing the prompt) and decode (generating new tokens) phases automatically.

Usage

Use the Dynamic Generator when:

  • Serving multiple concurrent users or requests
  • Building an API server or chat application with batched inference
  • Maximizing GPU throughput for high-volume text generation
  • Benefiting from prefix cache deduplication (shared system prompts)
  • Using speculative decoding for faster single-request throughput

Theoretical Basis

PagedAttention

# Traditional attention cache: contiguous allocation per sequence
# Wastes memory due to fragmentation and over-allocation

# Paged attention:
page_size = 256 tokens  # (configurable)
total_pages = cache_memory / (page_size * kv_size_per_token)

# Each sequence has a page table:
sequence.page_table = [page_3, page_17, page_42, ...]

# Attention for position i looks up:
physical_page = sequence.page_table[i // page_size]
offset = i % page_size
k, v = cache[physical_page][offset]

Concurrent Batching

# Dynamic batching combines multiple sequences into one forward pass:
function iterate():
    active_jobs = select_jobs_for_batch(max_batch_size)

    # Build combined input tensor
    input_ids = concatenate([job.next_tokens for job in active_jobs])
    cache_indices = concatenate([job.cache_pages for job in active_jobs])

    # Single forward pass for all jobs
    logits = model.forward(input_ids, cache_indices)

    # Distribute results
    for job, job_logits in zip(active_jobs, split(logits)):
        job.sample_and_update(job_logits)

Speculative Decoding

# Draft model generates k candidate tokens quickly
draft_tokens = draft_model.generate(context, k=num_draft_tokens)

# Main model verifies all k tokens in one forward pass
logits = main_model.forward(draft_tokens)

# Accept tokens that match, reject from first mismatch
accepted = verify_tokens(draft_tokens, logits)
# Typical acceptance rate: 70-90% for good draft models

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment