Principle:Turboderp org Exllamav2 Streaming Output
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Streaming, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The stream_ex() method is the core token-by-token generation loop that produces one token per call, returning decoded text chunks along with metadata about the generation state.
Description
After begin_stream_ex() initializes the generation context (encoding the prompt, prefilling the KV cache, and configuring sampling), stream_ex() is called repeatedly to generate text incrementally. Each call performs one iteration of the generation loop:
- Forward pass: The model processes the most recent token(s) and produces logits over the vocabulary.
- Sampling: The configured sampling strategy (temperature, top-k, top-p, etc.) selects the next token from the logit distribution.
- Cache update: The new token's key-value projections are stored in the KV cache for future attention.
- Decoding: The token ID is converted to text. Because tokens may represent partial characters (especially with byte-level BPE), the decoder buffers output until complete characters are formed.
- Stop condition checking: The generated token and accumulated text are checked against stop conditions (EOS tokens, stop strings). Stop string matching handles partial matches across chunk boundaries.
- Result construction: A dictionary is returned with the text chunk, EOS flag, token IDs, and optional metadata.
The two-method pattern (begin_stream_ex + stream_ex loop) provides several advantages:
- Responsive output: Text appears as it is generated, not after completion.
- Application control: The caller controls the loop and can abort, display, or process each chunk.
- Rich metadata: Optional probability, top-token, and logit data support visualization and debugging.
- Stop condition flexibility: Multiple stop conditions can be checked per token, with proper handling of multi-token stop strings.
Usage
Use the begin_stream_ex()/stream_ex() pattern when:
- Building interactive chat interfaces with streaming output
- Needing per-token probability or logit data
- Implementing custom stop logic beyond simple string matching
- Requiring fine-grained control over the generation loop (e.g., to update UI, log tokens, or implement timeouts)
Theoretical Basis
Streaming Generation State Machine
# State transitions for streaming generation:
IDLE -> PREFILLING (on begin_stream_ex)
PREFILLING -> GENERATING (after prompt is processed)
GENERATING -> GENERATING (on each stream_ex call, token generated)
GENERATING -> FINISHED (on EOS, stop condition, or max tokens)
FINISHED -> IDLE (ready for next begin_stream_ex)
Result Dictionary Structure
result = {
"chunk": str, # Decoded text for this step (may be empty
# if token produces no printable characters)
"eos": bool, # True if generation should stop
"chunk_token_ids": # Token IDs generated in this step
torch.Tensor, # Shape: (1, num_tokens) - usually 1 token,
# more with speculative decoding
# Optional (when requested in begin_stream_ex):
"probs": torch.Tensor, # Probability of selected token(s)
"top_tokens": list[tuple], # Top-k (token_id, probability) pairs
"logits": torch.Tensor, # Raw logits for full vocabulary
}
Token Healing at Boundaries
# When token_healing=True in begin_stream_ex():
# The last token of the prompt is "rewound" (removed from cache)
# and re-generated alongside the first completion token.
# This prevents tokenization artifacts at the prompt-completion boundary.
#
# Example without healing:
# prompt tokens: ["The", " answer", " is"]
# first generated: [" 42"] # awkward space handling
#
# Example with healing:
# prompt tokens: ["The", " answer", " is"] -> rewind to ["The", " answer"]
# first generated: [" is", " 42"] # natural continuation
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment