Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Iterate
| Knowledge Sources | |
|---|---|
| Domains | Concurrent_Batching, Inference_Optimization, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for executing one batch processing step across all active generation jobs and collecting incremental results, provided by exllamav2.
Description
The iterate() method on ExLlamaV2DynamicGenerator executes a single forward pass across all active jobs, samples tokens, checks stop conditions, and returns a list of result dictionaries. It is called repeatedly in a loop until all jobs are complete.
The companion method num_remaining_jobs() returns the count of jobs still in the queue (pending or active), and is used as the loop termination condition.
Each result dictionary contains:
- "job" - Reference to the job object
- "stage" - One of "started", "streaming", or "eos"
- "eos" - Boolean indicating if the job has completed
- "identifier" - The user-defined identifier passed when creating the job
- "text" - Incremental text chunk (during streaming)
- "full_completion" - Complete generated text (only present on EOS)
- "new_tokens" - Number of new tokens generated so far
- "time_enqueued" - Queue wait time in seconds
- "time_prefill" - Prefill processing time in seconds
- "time_generate" - Token generation time in seconds
Usage
Use iterate() in a while loop driven by num_remaining_jobs() to process all enqueued generation jobs. This is the standard pattern for both bulk inference and streaming multimodal generation.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/dynamic.py
- Lines: L915-1006 (iterate), L816-817 (num_remaining_jobs)
Signature
def iterate(self) -> list[dict]:
...
def num_remaining_jobs(self) -> int:
...
Import
from exllamav2.generator import ExLlamaV2DynamicGenerator
# iterate and num_remaining_jobs are methods on ExLlamaV2DynamicGenerator instances
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (self) | ExLlamaV2DynamicGenerator | Yes | The generator instance with one or more enqueued jobs |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[dict] | List of result dictionaries, one per job with activity this step. Each dict contains keys: "job", "stage", "eos", "identifier", "text", "full_completion" (on EOS), "new_tokens", "time_enqueued", "time_prefill", "time_generate" |
| remaining | int | From num_remaining_jobs(): count of jobs still pending or active in the generator queue |
Usage Examples
Basic Iterate Loop
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob
# Assume jobs have been enqueued
completions = {}
while generator.num_remaining_jobs() > 0:
results = generator.iterate()
for result in results:
if result["eos"]:
identifier = result["identifier"]
completions[identifier] = result["full_completion"]
print(f"Job {identifier} completed: {result['full_completion'][:100]}...")
Streaming Output
while generator.num_remaining_jobs() > 0:
results = generator.iterate()
for result in results:
if result["stage"] == "streaming":
# Print incremental text as it is generated
print(result["text"], end="", flush=True)
elif result["eos"]:
print(f"\n--- Job {result['identifier']} done ---")
print(f" Tokens: {result['new_tokens']}")
print(f" Prefill: {result['time_prefill']:.2f}s")
print(f" Generate: {result['time_generate']:.2f}s")
Bulk Processing with Timing
results_list = []
while generator.num_remaining_jobs() > 0:
results = generator.iterate()
for result in results:
if result["eos"]:
results_list.append({
"id": result["identifier"],
"text": result["full_completion"],
"tokens": result["new_tokens"],
"time_enqueued": result["time_enqueued"],
"time_prefill": result["time_prefill"],
"time_generate": result["time_generate"],
})
# Compute aggregate statistics
total_tokens = sum(r["tokens"] for r in results_list)
total_time = sum(r["time_generate"] for r in results_list)
print(f"Processed {len(results_list)} jobs, {total_tokens} tokens")