Implementation:Turboderp org Exllamav2 ExLlamaV2StreamingGenerator

Knowledge Sources	ExLlamaV2
Domains	Text_Generation, Streaming, User_Interface
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for initializing a streaming text generator optimized for single-sequence interactive use, provided by exllamav2.

Description

ExLlamaV2StreamingGenerator is the streaming generation class designed for interactive, single-sequence applications. Its constructor initializes the generator with a model, cache, and tokenizer, and optionally configures speculative decoding with a draft model.

The generator provides three core methods:

set_stop_conditions(): Configures stop strings and token IDs that terminate generation.
begin_stream_ex(): Initializes a generation context with the prompt, sampling settings, and optional features (token healing, filters, banned strings).
stream_ex(): Generates and returns one token (or speculative batch) per call.

The streaming generator maintains internal state between calls, tracking the current sequence position, cached context, and stop condition matching state.

Usage

Initialize ExLlamaV2StreamingGenerator after loading the model. Use it for:

CLI chat applications
Interactive notebooks
Single-user web chat backends
Any application requiring token-by-token output

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/generator/streaming.py
Lines: L124-192

Signature

class ExLlamaV2StreamingGenerator:

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer,
        draft_model: ExLlamaV2 | None = None,
        draft_cache: ExLlamaV2CacheBase | None = None,
        num_speculative_tokens: int = 5,
    ):
        ...

Import

from exllamav2.generator import ExLlamaV2StreamingGenerator

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	Loaded model instance with weights on GPU(s)
cache	ExLlamaV2CacheBase	Yes	Allocated KV cache (FP16, Q4, Q6, or Q8)
tokenizer	ExLlamaV2Tokenizer	Yes	Initialized tokenizer for encoding/decoding
draft_model	ExLlamaV2 or None	No (default None)	Smaller model for speculative decoding
draft_cache	ExLlamaV2CacheBase or None	No (default None)	Cache for the draft model
num_speculative_tokens	int	No (default 5)	Number of tokens to speculate per step when using draft model

Outputs

Name	Type	Description
generator instance	ExLlamaV2StreamingGenerator	Streaming generator ready for begin_stream_ex()/stream_ex() calls

Key Methods

set_stop_conditions

def set_stop_conditions(self, stop_conditions: list):
    """
    Set stop conditions for generation.
    Items can be token IDs (int) or stop strings (str).
    """
    ...

begin_stream_ex / stream_ex

See Implementation:Turboderp_org_Exllamav2_Stream_Ex for detailed documentation of these methods.

Usage Examples

Basic Initialization

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

With Speculative Decoding

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

# Main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

# Draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2StreamingGenerator(
    model, cache, tokenizer,
    draft_model=draft_model,
    draft_cache=draft_cache,
    num_speculative_tokens=5,
)

Complete Streaming Chat Loop

from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

prompt = "User: Hello! How are you?\nAssistant:"
input_ids = tokenizer.encode(prompt, add_bos=True)

generator.begin_stream_ex(input_ids, settings)

while True:
    result = generator.stream_ex()
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment