Implementation:Turboderp org Exllamav2 Stream Ex
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Streaming, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for token-by-token streaming text generation using the begin/stream two-method pattern, provided by exllamav2.
Description
begin_stream_ex() and stream_ex() are the core methods of ExLlamaV2StreamingGenerator that implement the streaming generation loop.
begin_stream_ex() initializes a generation session:
- Encodes or accepts pre-encoded input token IDs
- Configures sampling settings, stop conditions, and optional features
- Performs the prefill forward pass to populate the KV cache with prompt context
- Sets up token healing, filters, banned strings, and probability tracking if requested
stream_ex() generates one step of output:
- Runs a forward pass for the current token position
- Samples the next token according to configured settings
- Updates the KV cache with new key-value projections
- Decodes the token to text, handling partial character buffering
- Checks stop conditions (EOS token, stop strings, max tokens)
- Returns a result dictionary with the text chunk and metadata
Usage
Use this two-method pattern for any streaming generation scenario. Call begin_stream_ex() once, then call stream_ex() in a loop until result["eos"] is True.
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/streaming.py
- Lines: L241-396 (begin_stream_ex), L421-481 (stream_ex)
Signature
def begin_stream_ex(
self,
input_ids: torch.Tensor,
gen_settings: ExLlamaV2Sampler.Settings,
token_healing: bool = False,
loras: list | None = None,
input_mask: torch.Tensor | None = None,
position_offsets: torch.Tensor | None = None,
return_probabilities: bool = False,
return_top_tokens: int = 0,
return_logits: bool = False,
abort_event: threading.Event | None = None,
input_embeddings: torch.Tensor | None = None,
decode_special_tokens: bool = False,
banned_strings: list[str] | None = None,
filters: list | None = None,
filter_prefer_eos: bool = False,
**kwargs,
):
...
def stream_ex(
self,
ban_tokens: torch.Tensor | None = None,
**kwargs,
) -> dict:
...
Import
from exllamav2.generator import ExLlamaV2StreamingGenerator
# begin_stream_ex() and stream_ex() are methods on the generator instance
I/O Contract
begin_stream_ex Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.Tensor | Yes | Token IDs for the prompt, shape (1, seq_len) |
| gen_settings | ExLlamaV2Sampler.Settings | Yes | Sampling configuration (temperature, top-k, top-p, etc.) |
| token_healing | bool | No (default False) | Rewind last prompt token to avoid tokenization artifacts |
| loras | list or None | No (default None) | LoRA adapters to apply during generation |
| input_mask | torch.Tensor or None | No (default None) | Custom attention mask for the input |
| position_offsets | torch.Tensor or None | No (default None) | Custom position offsets for the input |
| return_probabilities | bool | No (default False) | Include token probabilities in stream_ex results |
| return_top_tokens | int | No (default 0) | Number of top tokens (with probabilities) to include in results; 0 = disabled |
| return_logits | bool | No (default False) | Include raw logits in stream_ex results |
| abort_event | threading.Event or None | No (default None) | Event to signal early termination from another thread |
| input_embeddings | torch.Tensor or None | No (default None) | Custom embeddings to prepend to input (for multimodal) |
| decode_special_tokens | bool | No (default False) | Include special tokens in decoded output text |
| banned_strings | list[str] or None | No (default None) | Strings that must not appear in generated output |
| filters | list or None | No (default None) | Constrained decoding filters (grammar, JSON schema, regex) |
| filter_prefer_eos | bool | No (default False) | Prefer EOS when filters allow it |
stream_ex Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ban_tokens | torch.Tensor or None | No (default None) | Token IDs to ban for this step only |
stream_ex Outputs
| Name | Type | Description |
|---|---|---|
| result | dict | Dictionary containing generation results for this step |
| result["chunk"] | str | Decoded text fragment generated in this step (may be empty for partial characters) |
| result["eos"] | bool | True if generation has ended (EOS token, stop condition, or max tokens) |
| result["chunk_token_ids"] | torch.Tensor | Token ID(s) generated in this step, shape (1, n) |
| result["probs"] | torch.Tensor | (Optional) Probability of selected token(s); only if return_probabilities=True |
| result["top_tokens"] | list[tuple] | (Optional) Top-k (token_id, probability) pairs; only if return_top_tokens > 0 |
| result["logits"] | torch.Tensor | (Optional) Raw logits over vocabulary; only if return_logits=True |
Usage Examples
Basic Streaming Loop
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
input_ids = tokenizer.encode("Once upon a time", add_bos=True)
generator.begin_stream_ex(input_ids, settings)
output = ""
while True:
result = generator.stream_ex()
output += result["chunk"]
print(result["chunk"], end="", flush=True)
if result["eos"]:
break
print()
Streaming with Probabilities
input_ids = tokenizer.encode("The capital of France is", add_bos=True)
generator.begin_stream_ex(
input_ids,
settings,
return_probabilities=True,
return_top_tokens=5,
)
while True:
result = generator.stream_ex()
if result["chunk"]:
prob = result["probs"][0].item() if "probs" in result else 0
print(f"{result['chunk']} (p={prob:.3f})", end="", flush=True)
if "top_tokens" in result:
for token_id, prob in result["top_tokens"][:3]:
token_str = tokenizer.decode(torch.tensor([[token_id]]))
print(f" [{token_str}: {prob:.3f}]", end="")
if result["eos"]:
break
print()
Streaming with Token Healing and Stop Conditions
generator.set_stop_conditions([
tokenizer.eos_token_id,
"User:",
"\n\n",
])
input_ids = tokenizer.encode(
"User: What is Python?\nAssistant:",
add_bos=True,
)
generator.begin_stream_ex(
input_ids,
settings,
token_healing=True,
)
response = ""
while True:
result = generator.stream_ex()
response += result["chunk"]
print(result["chunk"], end="", flush=True)
if result["eos"]:
break
print()
Streaming with Banned Strings
input_ids = tokenizer.encode("Write a story about a dog:", add_bos=True)
generator.begin_stream_ex(
input_ids,
settings,
banned_strings=["cat", "feline", "meow"],
)
while True:
result = generator.stream_ex()
print(result["chunk"], end="", flush=True)
if result["eos"]:
break
print()
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment