Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Iterate

Knowledge Sources	ExLlamaV2
Domains	Concurrent_Batching, Inference_Optimization, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for executing one batch processing step across all active generation jobs and collecting incremental results, provided by exllamav2.

Description

The iterate() method on ExLlamaV2DynamicGenerator executes a single forward pass across all active jobs, samples tokens, checks stop conditions, and returns a list of result dictionaries. It is called repeatedly in a loop until all jobs are complete.

The companion method num_remaining_jobs() returns the count of jobs still in the queue (pending or active), and is used as the loop termination condition.

Each result dictionary contains:

"job" - Reference to the job object
"stage" - One of "started", "streaming", or "eos"
"eos" - Boolean indicating if the job has completed
"identifier" - The user-defined identifier passed when creating the job
"text" - Incremental text chunk (during streaming)
"full_completion" - Complete generated text (only present on EOS)
"new_tokens" - Number of new tokens generated so far
"time_enqueued" - Queue wait time in seconds
"time_prefill" - Prefill processing time in seconds
"time_generate" - Token generation time in seconds

Usage

Use iterate() in a while loop driven by num_remaining_jobs() to process all enqueued generation jobs. This is the standard pattern for both bulk inference and streaming multimodal generation.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/generator/dynamic.py
Lines: L915-1006 (iterate), L816-817 (num_remaining_jobs)

Signature

def iterate(self) -> list[dict]:
    ...

def num_remaining_jobs(self) -> int:
    ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator
# iterate and num_remaining_jobs are methods on ExLlamaV2DynamicGenerator instances

I/O Contract

Inputs

Name	Type	Required	Description
(self)	ExLlamaV2DynamicGenerator	Yes	The generator instance with one or more enqueued jobs

Outputs

Name	Type	Description
results	list[dict]	List of result dictionaries, one per job with activity this step. Each dict contains keys: "job", "stage", "eos", "identifier", "text", "full_completion" (on EOS), "new_tokens", "time_enqueued", "time_prefill", "time_generate"
remaining	int	From num_remaining_jobs(): count of jobs still pending or active in the generator queue

Usage Examples

Basic Iterate Loop

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob

# Assume jobs have been enqueued
completions = {}

while generator.num_remaining_jobs() > 0:
    results = generator.iterate()
    for result in results:
        if result["eos"]:
            identifier = result["identifier"]
            completions[identifier] = result["full_completion"]
            print(f"Job {identifier} completed: {result['full_completion'][:100]}...")

Streaming Output

while generator.num_remaining_jobs() > 0:
    results = generator.iterate()
    for result in results:
        if result["stage"] == "streaming":
            # Print incremental text as it is generated
            print(result["text"], end="", flush=True)
        elif result["eos"]:
            print(f"\n--- Job {result['identifier']} done ---")
            print(f"  Tokens: {result['new_tokens']}")
            print(f"  Prefill: {result['time_prefill']:.2f}s")
            print(f"  Generate: {result['time_generate']:.2f}s")

Bulk Processing with Timing

results_list = []

while generator.num_remaining_jobs() > 0:
    results = generator.iterate()
    for result in results:
        if result["eos"]:
            results_list.append({
                "id": result["identifier"],
                "text": result["full_completion"],
                "tokens": result["new_tokens"],
                "time_enqueued": result["time_enqueued"],
                "time_prefill": result["time_prefill"],
                "time_generate": result["time_generate"],
            })

# Compute aggregate statistics
total_tokens = sum(r["tokens"] for r in results_list)
total_time = sum(r["time_generate"] for r in results_list)
print(f"Processed {len(results_list)} jobs, {total_tokens} tokens")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment