Principle:Turboderp org Exllamav2 Streaming Generation
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Streaming, User_Interface |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Streaming generation produces text token-by-token, yielding partial results as they are generated rather than waiting for the full completion, enabling responsive user interfaces.
Description
In many interactive applications (chatbots, code assistants, writing tools), users benefit from seeing text appear incrementally as the model generates it, rather than waiting for the entire response to complete. Streaming generation achieves this by:
- Token-by-token output: After each forward pass, the generated token is immediately decoded and yielded to the caller. This allows the application to display text as it appears, providing a responsive user experience.
- Two-phase operation: The streaming generator uses a two-method pattern:
- begin_stream_ex(): Initializes the generation context by encoding the prompt, setting up the KV cache, configuring sampling parameters, and performing the initial prefill forward pass.
- stream_ex(): Called repeatedly in a loop, each call generates one token (or a small batch with speculative decoding), decodes it, checks stop conditions, and returns a result dict.
- Speculative decoding support: The streaming generator can optionally use a smaller draft model to speculatively generate multiple candidate tokens, which are then verified by the main model in a single forward pass. When candidates are accepted, multiple tokens are produced per iteration, improving throughput.
- Single-sequence focus: Unlike the Dynamic Generator which handles concurrent batching, the Streaming Generator is optimized for single-sequence interactive use. This makes it simpler and more predictable for applications that process one conversation at a time.
- Rich result metadata: Each streaming step can optionally return probability information, top-k token candidates, and raw logits alongside the generated text, useful for debugging, visualization, or custom sampling logic.
Usage
Use the Streaming Generator when:
- Building interactive chat interfaces where text should appear incrementally
- Implementing single-user CLI chat applications
- Needing fine-grained control over the generation loop
- Requiring per-token metadata (probabilities, logits)
- Using speculative decoding for single-sequence throughput improvement
For multi-user concurrent serving, use the Dynamic Generator instead.
Theoretical Basis
Streaming Generation Loop
# Phase 1: Initialize stream
generator.begin_stream_ex(input_ids, gen_settings)
# Phase 2: Token-by-token generation
full_text = ""
while True:
result = generator.stream_ex()
# result["chunk"] contains the new text fragment
full_text += result["chunk"]
display(result["chunk"]) # Show to user immediately
# result["eos"] signals end of generation
if result["eos"]:
break
Speculative Decoding in Streaming
# With speculative decoding, each stream_ex() call may produce
# multiple tokens if the draft model's predictions are accepted:
# Step 1: Draft model generates k candidate tokens
draft_tokens = draft_model.forward(context, k=5)
# Step 2: Main model verifies all k tokens in one pass
logits = main_model.forward(context + draft_tokens)
# Step 3: Accept matching tokens, reject from first mismatch
# Accepted tokens appear in result["chunk"] as a multi-character string
# If 4 of 5 tokens accepted: chunk = "four decoded tokens"
# Throughput improvement: ~4x for that step
Comparison with Dynamic Generator
# Streaming Generator:
# - Single sequence at a time
# - Simple begin/stream loop
# - Immediate per-token output
# - Best for interactive single-user applications
# Dynamic Generator:
# - Multiple concurrent sequences
# - Paged attention with page tables
# - Job-based API
# - Best for server/batch workloads
Related Pages
Implemented By
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment