Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicJob

From Leeroopedia
Knowledge Sources
Domains Concurrent_Batching, Inference_Optimization, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for creating individual generation jobs with per-request configuration for the ExLlamaV2 dynamic generator, provided by exllamav2.

Description

ExLlamaV2DynamicJob encapsulates a single generation request with its own input tokens, generation settings, stop conditions, optional multimodal embeddings, and a tracking identifier. Jobs are created independently and then enqueued into the ExLlamaV2DynamicGenerator for concurrent processing.

The __init__ method accepts all configuration for the generation task. The enqueue() method on the generator submits the job for processing and returns a serial number. Once enqueued, the job progresses through prefill and decode stages, with results delivered through the generator's iterate() method.

Key features:

  • Per-job generation settings - Each job can have different sampling parameters
  • Per-job stop conditions - Different stop tokens or strings per request
  • Per-job embeddings - Essential for multimodal inference where each request has different images
  • Identifier tracking - User-defined object returned with results for request correlation

Usage

Use this when you need the job-based generation API for multimodal inference, bulk dataset processing, or any scenario requiring per-request configuration. Create one job per inference request, enqueue it, and collect results via the iterate loop.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/generator/dynamic.py
  • Lines: L1601-1810 (ExLlamaV2DynamicJob.__init__), L832-853 (enqueue)

Signature

class ExLlamaV2DynamicJob:
    def __init__(
        self,
        input_ids: torch.Tensor,
        max_new_tokens: int,
        decode_special_tokens: bool = True,
        stop_conditions: list = ...,
        gen_settings: ExLlamaV2Sampler.Settings = ...,
        identifier: object = None,
        embeddings: list[ExLlamaV2MMEmbedding] = ...,
        **kwargs
    ):
        ...

# Enqueue method on generator:
def enqueue(self, job: ExLlamaV2DynamicJob) -> int:
    ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob

I/O Contract

Inputs

Name Type Required Description
input_ids torch.Tensor Yes Tokenized prompt as a tensor of shape (1, seq_len); may contain multimodal token IDs if embeddings are provided
max_new_tokens int Yes Maximum number of tokens to generate for this job
decode_special_tokens bool No Whether to decode special tokens in the output text; default True
stop_conditions list No List of stop tokens (int) or stop strings (str) that signal generation completion
gen_settings ExLlamaV2Sampler.Settings No Sampling settings (temperature, top_k, top_p, etc.) for this job
identifier object No User-defined object returned with results for tracking which job produced which output; default None
embeddings list[ExLlamaV2MMEmbedding] No List of multimodal embedding containers for vision-language inference; their token IDs must be present in input_ids

Outputs

Name Type Description
serial_number int Returned by generator.enqueue(job); unique serial number for the enqueued job

Usage Examples

Basic

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
from exllamav2 import ExLlamaV2Sampler

# Assume generator, tokenizer are initialized
gen_settings = ExLlamaV2Sampler.Settings(
    temperature=0.7,
    top_k=50,
    top_p=0.9
)

input_ids = tokenizer.encode("What is the capital of France?")

job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    max_new_tokens=200,
    gen_settings=gen_settings,
    stop_conditions=[tokenizer.eos_token_id],
    identifier="question_1"
)

serial = generator.enqueue(job)

Multimodal Job

from PIL import Image

# Get image embeddings
image = Image.open("/path/to/photo.jpg")
embedding = model.vision_model.get_image_embeddings(
    model=model, tokenizer=tokenizer, image=image
)

# Encode prompt with image placeholder
prompt = f"Describe this image: {embedding.text_alias}"
input_ids = tokenizer.encode(prompt, embeddings=[embedding])

# Create job with embeddings
job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    max_new_tokens=500,
    stop_conditions=[tokenizer.eos_token_id],
    identifier="image_description",
    embeddings=[embedding]
)

generator.enqueue(job)

Bulk Enqueue

# Enqueue multiple jobs for concurrent processing
prompts = ["Explain gravity.", "What is DNA?", "Describe photosynthesis."]

for i, prompt in enumerate(prompts):
    input_ids = tokenizer.encode(prompt)
    job = ExLlamaV2DynamicJob(
        input_ids=input_ids,
        max_new_tokens=300,
        gen_settings=gen_settings,
        stop_conditions=[tokenizer.eos_token_id],
        identifier=f"batch_{i}"
    )
    generator.enqueue(job)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment