Principle:Turboderp org Exllamav2 Job Based Generation
| Knowledge Sources | |
|---|---|
| Domains | Concurrent_Batching, Inference_Optimization, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Job-based generation provides fine-grained control over individual inference requests by encapsulating each generation task as an independent job with its own inputs, settings, and tracking identifier.
Description
For advanced use cases such as multimodal inference, bulk processing, and concurrent request handling, the Dynamic Generator provides a job-based API as an alternative to the high-level generate() method. In the job-based approach:
Job Creation: Each inference request is encapsulated as an ExLlamaV2DynamicJob object. The job carries:
- Input token IDs - the tokenized prompt
- Generation settings - sampling parameters (temperature, top-k, top-p, etc.)
- Stop conditions - tokens or strings that signal generation completion
- Maximum new tokens - upper bound on generated length
- Multimodal embeddings - optional image/vision embeddings for VLM inference
- Identifier - a user-defined object for tracking which results belong to which request
Job Enqueuing: Created jobs are submitted to the generator's internal queue via the enqueue() method, which returns a serial number for the job. The generator maintains a pool of active jobs and processes them concurrently.
Concurrent Processing: The generator uses paged attention (PagedAttention) to efficiently share KV cache memory across multiple active jobs. Cache pages can be reused across jobs with shared prompt prefixes, reducing memory consumption and improving throughput.
Per-Job Control: Each job operates independently -- it has its own generation settings, stop conditions, and result stream. This enables mixed workloads where different requests have different requirements (e.g., different sampling temperatures or different stop tokens).
Usage
Use job-based generation when you need:
- Per-request multimodal embeddings (each job can carry different image embeddings)
- Concurrent processing of multiple requests with different settings
- Fine-grained control over individual request lifecycle
- Streaming results from multiple concurrent generations
- Bulk inference over datasets with result tracking via identifiers
Theoretical Basis
The job-based design follows the principles of PagedAttention:
Traditional KV Cache:
- Each request allocates contiguous memory for max_seq_len
- Memory waste from over-allocation and fragmentation
PagedAttention KV Cache:
- Memory divided into fixed-size pages
- Each job's KV cache mapped to non-contiguous pages
- Pages allocated on-demand as sequences grow
- Shared prefixes can share pages (copy-on-write)
Job Lifecycle:
1. CREATE: Job(input_ids, settings, embeddings, identifier)
2. ENQUEUE: generator.enqueue(job) -> serial_number
3. PREFILL: First forward pass processes full input sequence
4. DECODE: Iterative token generation, one token per step
5. STREAM: Results yielded incrementally via iterate()
6. COMPLETE: Stop condition met, full result returned
The generator maintains an internal scheduler that:
- Prioritizes prefill for newly enqueued jobs
- Batches decode steps across active jobs for GPU efficiency
- Manages cache page allocation and deallocation
- Handles job completion and result delivery