Principle:Turboderp org Exllamav2 Job Based Generation

Knowledge Sources	Efficient Memory Management for Large Language Model Serving with PagedAttention
Domains	Concurrent_Batching, Inference_Optimization, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Job-based generation provides fine-grained control over individual inference requests by encapsulating each generation task as an independent job with its own inputs, settings, and tracking identifier.

Description

For advanced use cases such as multimodal inference, bulk processing, and concurrent request handling, the Dynamic Generator provides a job-based API as an alternative to the high-level generate() method. In the job-based approach:

Job Creation: Each inference request is encapsulated as an ExLlamaV2DynamicJob object. The job carries:

Input token IDs - the tokenized prompt
Generation settings - sampling parameters (temperature, top-k, top-p, etc.)
Stop conditions - tokens or strings that signal generation completion
Maximum new tokens - upper bound on generated length
Multimodal embeddings - optional image/vision embeddings for VLM inference
Identifier - a user-defined object for tracking which results belong to which request

Job Enqueuing: Created jobs are submitted to the generator's internal queue via the enqueue() method, which returns a serial number for the job. The generator maintains a pool of active jobs and processes them concurrently.

Concurrent Processing: The generator uses paged attention (PagedAttention) to efficiently share KV cache memory across multiple active jobs. Cache pages can be reused across jobs with shared prompt prefixes, reducing memory consumption and improving throughput.

Per-Job Control: Each job operates independently -- it has its own generation settings, stop conditions, and result stream. This enables mixed workloads where different requests have different requirements (e.g., different sampling temperatures or different stop tokens).

Usage

Use job-based generation when you need:

Per-request multimodal embeddings (each job can carry different image embeddings)
Concurrent processing of multiple requests with different settings
Fine-grained control over individual request lifecycle
Streaming results from multiple concurrent generations
Bulk inference over datasets with result tracking via identifiers

Theoretical Basis

The job-based design follows the principles of PagedAttention:

Traditional KV Cache:
  - Each request allocates contiguous memory for max_seq_len
  - Memory waste from over-allocation and fragmentation

PagedAttention KV Cache:
  - Memory divided into fixed-size pages
  - Each job's KV cache mapped to non-contiguous pages
  - Pages allocated on-demand as sequences grow
  - Shared prefixes can share pages (copy-on-write)

Job Lifecycle:
  1. CREATE:  Job(input_ids, settings, embeddings, identifier)
  2. ENQUEUE: generator.enqueue(job) -> serial_number
  3. PREFILL: First forward pass processes full input sequence
  4. DECODE:  Iterative token generation, one token per step
  5. STREAM:  Results yielded incrementally via iterate()
  6. COMPLETE: Stop condition met, full result returned

The generator maintains an internal scheduler that:

Prioritizes prefill for newly enqueued jobs
Batches decode steps across active jobs for GPU efficiency
Manages cache page allocation and deallocation
Handles job completion and result delivery

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_ExLlamaV2DynamicJob

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment