Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2StreamingGenerator

From Leeroopedia
Knowledge Sources
Domains Text_Generation, Streaming, User_Interface
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for initializing a streaming text generator optimized for single-sequence interactive use, provided by exllamav2.

Description

ExLlamaV2StreamingGenerator is the streaming generation class designed for interactive, single-sequence applications. Its constructor initializes the generator with a model, cache, and tokenizer, and optionally configures speculative decoding with a draft model.

The generator provides three core methods:

  • set_stop_conditions(): Configures stop strings and token IDs that terminate generation.
  • begin_stream_ex(): Initializes a generation context with the prompt, sampling settings, and optional features (token healing, filters, banned strings).
  • stream_ex(): Generates and returns one token (or speculative batch) per call.

The streaming generator maintains internal state between calls, tracking the current sequence position, cached context, and stop condition matching state.

Usage

Initialize ExLlamaV2StreamingGenerator after loading the model. Use it for:

  • CLI chat applications
  • Interactive notebooks
  • Single-user web chat backends
  • Any application requiring token-by-token output

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/generator/streaming.py
  • Lines: L124-192

Signature

class ExLlamaV2StreamingGenerator:

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer,
        draft_model: ExLlamaV2 | None = None,
        draft_cache: ExLlamaV2CacheBase | None = None,
        num_speculative_tokens: int = 5,
    ):
        ...

Import

from exllamav2.generator import ExLlamaV2StreamingGenerator

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes Loaded model instance with weights on GPU(s)
cache ExLlamaV2CacheBase Yes Allocated KV cache (FP16, Q4, Q6, or Q8)
tokenizer ExLlamaV2Tokenizer Yes Initialized tokenizer for encoding/decoding
draft_model ExLlamaV2 or None No (default None) Smaller model for speculative decoding
draft_cache ExLlamaV2CacheBase or None No (default None) Cache for the draft model
num_speculative_tokens int No (default 5) Number of tokens to speculate per step when using draft model

Outputs

Name Type Description
generator instance ExLlamaV2StreamingGenerator Streaming generator ready for begin_stream_ex()/stream_ex() calls

Key Methods

set_stop_conditions

def set_stop_conditions(self, stop_conditions: list):
    """
    Set stop conditions for generation.
    Items can be token IDs (int) or stop strings (str).
    """
    ...

begin_stream_ex / stream_ex

See Implementation:Turboderp_org_Exllamav2_Stream_Ex for detailed documentation of these methods.

Usage Examples

Basic Initialization

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

config = ExLlamaV2Config("/path/to/model")
config.prepare()

model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)

With Speculative Decoding

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2StreamingGenerator

# Main model
config = ExLlamaV2Config("/path/to/model")
config.prepare()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, lazy=True)
model.load_autosplit(cache)

# Draft model
draft_config = ExLlamaV2Config("/path/to/draft_model")
draft_config.prepare()
draft_model = ExLlamaV2(draft_config)
draft_cache = ExLlamaV2Cache(draft_model, lazy=True)
draft_model.load_autosplit(draft_cache)

tokenizer = ExLlamaV2Tokenizer(config)

generator = ExLlamaV2StreamingGenerator(
    model, cache, tokenizer,
    draft_model=draft_model,
    draft_cache=draft_cache,
    num_speculative_tokens=5,
)

Complete Streaming Chat Loop

from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

prompt = "User: Hello! How are you?\nAssistant:"
input_ids = tokenizer.encode(prompt, add_bos=True)

generator.begin_stream_ex(input_ids, settings)

while True:
    result = generator.stream_ex()
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment