Principle:Turboderp org Exllamav2 Streaming Output

Knowledge Sources	ExLlamaV2
Domains	Text_Generation, Streaming, NLP
Last Updated	2026-02-15 00:00 GMT

Overview

The stream_ex() method is the core token-by-token generation loop that produces one token per call, returning decoded text chunks along with metadata about the generation state.

Description

After begin_stream_ex() initializes the generation context (encoding the prompt, prefilling the KV cache, and configuring sampling), stream_ex() is called repeatedly to generate text incrementally. Each call performs one iteration of the generation loop:

Forward pass: The model processes the most recent token(s) and produces logits over the vocabulary.
Sampling: The configured sampling strategy (temperature, top-k, top-p, etc.) selects the next token from the logit distribution.
Cache update: The new token's key-value projections are stored in the KV cache for future attention.
Decoding: The token ID is converted to text. Because tokens may represent partial characters (especially with byte-level BPE), the decoder buffers output until complete characters are formed.
Stop condition checking: The generated token and accumulated text are checked against stop conditions (EOS tokens, stop strings). Stop string matching handles partial matches across chunk boundaries.
Result construction: A dictionary is returned with the text chunk, EOS flag, token IDs, and optional metadata.

The two-method pattern (begin_stream_ex + stream_ex loop) provides several advantages:

Responsive output: Text appears as it is generated, not after completion.
Application control: The caller controls the loop and can abort, display, or process each chunk.
Rich metadata: Optional probability, top-token, and logit data support visualization and debugging.
Stop condition flexibility: Multiple stop conditions can be checked per token, with proper handling of multi-token stop strings.

Usage

Use the begin_stream_ex()/stream_ex() pattern when:

Building interactive chat interfaces with streaming output
Needing per-token probability or logit data
Implementing custom stop logic beyond simple string matching
Requiring fine-grained control over the generation loop (e.g., to update UI, log tokens, or implement timeouts)

Theoretical Basis

Streaming Generation State Machine

# State transitions for streaming generation:

IDLE -> PREFILLING (on begin_stream_ex)
PREFILLING -> GENERATING (after prompt is processed)
GENERATING -> GENERATING (on each stream_ex call, token generated)
GENERATING -> FINISHED (on EOS, stop condition, or max tokens)
FINISHED -> IDLE (ready for next begin_stream_ex)

Result Dictionary Structure

result = {
    "chunk": str,           # Decoded text for this step (may be empty
                            # if token produces no printable characters)

    "eos": bool,            # True if generation should stop

    "chunk_token_ids":      # Token IDs generated in this step
        torch.Tensor,       # Shape: (1, num_tokens) - usually 1 token,
                            # more with speculative decoding

    # Optional (when requested in begin_stream_ex):
    "probs": torch.Tensor,         # Probability of selected token(s)
    "top_tokens": list[tuple],     # Top-k (token_id, probability) pairs
    "logits": torch.Tensor,        # Raw logits for full vocabulary
}

Token Healing at Boundaries

# When token_healing=True in begin_stream_ex():
# The last token of the prompt is "rewound" (removed from cache)
# and re-generated alongside the first completion token.
# This prevents tokenization artifacts at the prompt-completion boundary.
#
# Example without healing:
#   prompt tokens: ["The", " answer", " is"]
#   first generated: [" 42"]  # awkward space handling
#
# Example with healing:
#   prompt tokens: ["The", " answer", " is"]  -> rewind to ["The", " answer"]
#   first generated: [" is", " 42"]  # natural continuation

Related Pages

Implemented By

Implementation:Turboderp_org_Exllamav2_Stream_Ex

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment