Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Job Based Generation

From Leeroopedia
Knowledge Sources
Domains Concurrent_Batching, Inference_Optimization, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Job-based generation provides fine-grained control over individual inference requests by encapsulating each generation task as an independent job with its own inputs, settings, and tracking identifier.

Description

For advanced use cases such as multimodal inference, bulk processing, and concurrent request handling, the Dynamic Generator provides a job-based API as an alternative to the high-level generate() method. In the job-based approach:

Job Creation: Each inference request is encapsulated as an ExLlamaV2DynamicJob object. The job carries:

  • Input token IDs - the tokenized prompt
  • Generation settings - sampling parameters (temperature, top-k, top-p, etc.)
  • Stop conditions - tokens or strings that signal generation completion
  • Maximum new tokens - upper bound on generated length
  • Multimodal embeddings - optional image/vision embeddings for VLM inference
  • Identifier - a user-defined object for tracking which results belong to which request

Job Enqueuing: Created jobs are submitted to the generator's internal queue via the enqueue() method, which returns a serial number for the job. The generator maintains a pool of active jobs and processes them concurrently.

Concurrent Processing: The generator uses paged attention (PagedAttention) to efficiently share KV cache memory across multiple active jobs. Cache pages can be reused across jobs with shared prompt prefixes, reducing memory consumption and improving throughput.

Per-Job Control: Each job operates independently -- it has its own generation settings, stop conditions, and result stream. This enables mixed workloads where different requests have different requirements (e.g., different sampling temperatures or different stop tokens).

Usage

Use job-based generation when you need:

  • Per-request multimodal embeddings (each job can carry different image embeddings)
  • Concurrent processing of multiple requests with different settings
  • Fine-grained control over individual request lifecycle
  • Streaming results from multiple concurrent generations
  • Bulk inference over datasets with result tracking via identifiers

Theoretical Basis

The job-based design follows the principles of PagedAttention:

Traditional KV Cache:
  - Each request allocates contiguous memory for max_seq_len
  - Memory waste from over-allocation and fragmentation

PagedAttention KV Cache:
  - Memory divided into fixed-size pages
  - Each job's KV cache mapped to non-contiguous pages
  - Pages allocated on-demand as sequences grow
  - Shared prefixes can share pages (copy-on-write)

Job Lifecycle:
  1. CREATE:  Job(input_ids, settings, embeddings, identifier)
  2. ENQUEUE: generator.enqueue(job) -> serial_number
  3. PREFILL: First forward pass processes full input sequence
  4. DECODE:  Iterative token generation, one token per step
  5. STREAM:  Results yielded incrementally via iterate()
  6. COMPLETE: Stop condition met, full result returned

The generator maintains an internal scheduler that:

  • Prioritizes prefill for newly enqueued jobs
  • Batches decode steps across active jobs for GPU efficiency
  • Manages cache page allocation and deallocation
  • Handles job completion and result delivery

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment