Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicJob
| Knowledge Sources | |
|---|---|
| Domains | Concurrent_Batching, Inference_Optimization, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for creating individual generation jobs with per-request configuration for the ExLlamaV2 dynamic generator, provided by exllamav2.
Description
ExLlamaV2DynamicJob encapsulates a single generation request with its own input tokens, generation settings, stop conditions, optional multimodal embeddings, and a tracking identifier. Jobs are created independently and then enqueued into the ExLlamaV2DynamicGenerator for concurrent processing.
The __init__ method accepts all configuration for the generation task. The enqueue() method on the generator submits the job for processing and returns a serial number. Once enqueued, the job progresses through prefill and decode stages, with results delivered through the generator's iterate() method.
Key features:
- Per-job generation settings - Each job can have different sampling parameters
- Per-job stop conditions - Different stop tokens or strings per request
- Per-job embeddings - Essential for multimodal inference where each request has different images
- Identifier tracking - User-defined object returned with results for request correlation
Usage
Use this when you need the job-based generation API for multimodal inference, bulk dataset processing, or any scenario requiring per-request configuration. Create one job per inference request, enqueue it, and collect results via the iterate loop.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/dynamic.py
- Lines: L1601-1810 (ExLlamaV2DynamicJob.__init__), L832-853 (enqueue)
Signature
class ExLlamaV2DynamicJob:
def __init__(
self,
input_ids: torch.Tensor,
max_new_tokens: int,
decode_special_tokens: bool = True,
stop_conditions: list = ...,
gen_settings: ExLlamaV2Sampler.Settings = ...,
identifier: object = None,
embeddings: list[ExLlamaV2MMEmbedding] = ...,
**kwargs
):
...
# Enqueue method on generator:
def enqueue(self, job: ExLlamaV2DynamicJob) -> int:
...
Import
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Tokenized prompt as a tensor of shape (1, seq_len); may contain multimodal token IDs if embeddings are provided |
| max_new_tokens | int | Yes | Maximum number of tokens to generate for this job |
| decode_special_tokens | bool | No | Whether to decode special tokens in the output text; default True |
| stop_conditions | list | No | List of stop tokens (int) or stop strings (str) that signal generation completion |
| gen_settings | ExLlamaV2Sampler.Settings | No | Sampling settings (temperature, top_k, top_p, etc.) for this job |
| identifier | object | No | User-defined object returned with results for tracking which job produced which output; default None |
| embeddings | list[ExLlamaV2MMEmbedding] | No | List of multimodal embedding containers for vision-language inference; their token IDs must be present in input_ids |
Outputs
| Name | Type | Description |
|---|---|---|
| serial_number | int | Returned by generator.enqueue(job); unique serial number for the enqueued job |
Usage Examples
Basic
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
from exllamav2 import ExLlamaV2Sampler
# Assume generator, tokenizer are initialized
gen_settings = ExLlamaV2Sampler.Settings(
temperature=0.7,
top_k=50,
top_p=0.9
)
input_ids = tokenizer.encode("What is the capital of France?")
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
max_new_tokens=200,
gen_settings=gen_settings,
stop_conditions=[tokenizer.eos_token_id],
identifier="question_1"
)
serial = generator.enqueue(job)
Multimodal Job
from PIL import Image
# Get image embeddings
image = Image.open("/path/to/photo.jpg")
embedding = model.vision_model.get_image_embeddings(
model=model, tokenizer=tokenizer, image=image
)
# Encode prompt with image placeholder
prompt = f"Describe this image: {embedding.text_alias}"
input_ids = tokenizer.encode(prompt, embeddings=[embedding])
# Create job with embeddings
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
max_new_tokens=500,
stop_conditions=[tokenizer.eos_token_id],
identifier="image_description",
embeddings=[embedding]
)
generator.enqueue(job)
Bulk Enqueue
# Enqueue multiple jobs for concurrent processing
prompts = ["Explain gravity.", "What is DNA?", "Describe photosynthesis."]
for i, prompt in enumerate(prompts):
input_ids = tokenizer.encode(prompt)
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
max_new_tokens=300,
gen_settings=gen_settings,
stop_conditions=[tokenizer.eos_token_id],
identifier=f"batch_{i}"
)
generator.enqueue(job)