Implementation:Turboderp org Exllamav2 ExLlamaV2StreamingGenerator
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Streaming, User_Interface |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for initializing a streaming text generator optimized for single-sequence interactive use, provided by exllamav2.
Description
ExLlamaV2StreamingGenerator is the streaming generation class designed for interactive, single-sequence applications. Its constructor initializes the generator with a model, cache, and tokenizer, and optionally configures speculative decoding with a draft model.
The generator provides three core methods:
- set_stop_conditions(): Configures stop strings and token IDs that terminate generation.
- begin_stream_ex(): Initializes a generation context with the prompt, sampling settings, and optional features (token healing, filters, banned strings).
- stream_ex(): Generates and returns one token (or speculative batch) per call.
The streaming generator maintains internal state between calls, tracking the current sequence position, cached context, and stop condition matching state.
Usage
Initialize ExLlamaV2StreamingGenerator after loading the model. Use it for:
- CLI chat applications
- Interactive notebooks
- Single-user web chat backends
- Any application requiring token-by-token output
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/streaming.py
- Lines: L124-192
Signature
class ExLlamaV2StreamingGenerator:
def __init__(
self,
model: ExLlamaV2,
cache: ExLlamaV2CacheBase,
tokenizer: ExLlamaV2Tokenizer,
draft_model: ExLlamaV2 | None = None,
draft_cache: ExLlamaV2CacheBase | None = None,
num_speculative_tokens: int = 5,
):
...
Import
from exllamav2.generator import ExLlamaV2StreamingGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | Loaded model instance with weights on GPU(s) |
| cache | ExLlamaV2CacheBase | Yes | Allocated KV cache (FP16, Q4, Q6, or Q8) |
| tokenizer | ExLlamaV2Tokenizer | Yes | Initialized tokenizer for encoding/decoding |
| draft_model | ExLlamaV2 or None | No (default None) | Smaller model for speculative decoding |
| draft_cache | ExLlamaV2CacheBase or None | No (default None) | Cache for the draft model |
| num_speculative_tokens | int | No (default 5) | Number of tokens to speculate per step when using draft model |
Outputs
| Name | Type | Description |
|---|---|---|
| generator instance | ExLlamaV2StreamingGenerator | Streaming generator ready for begin_stream_ex()/stream_ex() calls |
Key Methods
set_stop_conditions
def set_stop_conditions(self, stop_conditions: list):
"""
Set stop conditions for generation.
Items can be token IDs (int) or stop strings (str).
"""
...
begin_stream_ex / stream_ex
See Implementation:Turboderp_org_Exllamav2_Stream_Ex for detailed documentation of these methods.
Usage Examples
Basic Initialization
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
With Speculative Decoding
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator
# Main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)
# Draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)
tokenizer = ExLlamaV2Tokenizer(config)
generator = ExLlamaV2StreamingGenerator(
model, cache, tokenizer,
draft_model=draft_model,
draft_cache=draft_cache,
num_speculative_tokens=5,
)
Complete Streaming Chat Loop
from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler
generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
prompt = "User: Hello! How are you?\nAssistant:"
input_ids = tokenizer.encode(prompt, add_bos=True)
generator.begin_stream_ex(input_ids, settings)
while True:
result = generator.stream_ex()
print(result["chunk"], end="", flush=True)
if result["eos"]:
break
print()