Principle:Turboderp org Exllamav2 Batch Job Iteration

Knowledge Sources	Efficient Memory Management for Large Language Model Serving with PagedAttention
Domains	Concurrent_Batching, Inference_Optimization, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Batch job iteration is the core processing loop that drives concurrent text generation by executing one forward pass per call across all active jobs, returning incremental results for each.

Description

The iterate() method is the heartbeat of the job-based generation system. Each invocation performs a single batch processing step across all active jobs in the generator's queue. The processing within a single iterate() call includes:

Forward Pass: The model runs a single batched forward pass that processes tokens from all active jobs simultaneously. This amortizes the cost of loading model weights across multiple concurrent requests.

Token Sampling: For each job in the decode phase, a new token is sampled according to that job's individual generation settings (temperature, top-k, top-p, etc.).

Stop Condition Checking: Each job's newly generated token is checked against its stop conditions (EOS tokens, stop strings). Jobs that meet a stop condition are marked as complete.

Result Assembly: The method returns a list of result dictionaries, one for each job that has something to report. Results include streaming text chunks, timing statistics, and completion signals.

The standard processing loop calls iterate() repeatedly until num_remaining_jobs() returns 0, indicating all enqueued jobs have completed:

while generator.num_remaining_jobs() > 0:
    results = generator.iterate()
    for result in results:
        # Process streaming text, check for completion

This design supports continuous batching: new jobs can be enqueued between iterate() calls, allowing the system to maintain high throughput by filling GPU capacity as existing jobs complete.

Usage

Use the iterate loop whenever processing jobs through the dynamic generator. This is the mechanism for receiving streaming output from all active jobs, tracking completion, and collecting timing statistics.

Theoretical Basis

Iterate Processing Pipeline (one call):

1. SCHEDULE:
   - Select active jobs for this batch step
   - Identify jobs needing prefill vs. decode
   - Allocate/manage cache pages

2. FORWARD PASS:
   - Batch input tokens from all active jobs
   - Run model forward pass (attention + MLP layers)
   - Output: logits for next token per job

3. SAMPLE:
   - For each job, apply its gen_settings to logits
   - Sample next token per job independently
   - Append token to each job's sequence

4. CHECK STOP CONDITIONS:
   - For each job, check if new token matches stop_conditions
   - Mark completed jobs for result delivery

5. RETURN RESULTS:
   results = [
     {
       "job": job_ref,
       "stage": "started" | "streaming" | "eos",
       "eos": bool,
       "identifier": user_defined_id,
       "text": incremental_text_chunk,
       "full_completion": complete_text (on EOS only),
       "new_tokens": count,
       "time_enqueued": float,
       "time_prefill": float,
       "time_generate": float
     },
     ...
   ]

The stage field tracks the job's lifecycle:

"started" - Job has begun prefill processing
"streaming" - Job is actively generating tokens; text contains the latest chunk
"eos" - Job has completed; full_completion contains the entire generated text

The timing fields enable performance monitoring:

time_enqueued - How long the job waited before processing began
time_prefill - Time spent on the initial prompt processing
time_generate - Time spent on token-by-token generation

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_ExLlamaV2DynamicGenerator_Iterate

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Dynamic_Generator_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment