Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 Stream Ex

From Leeroopedia
Knowledge Sources
Domains Text_Generation, Streaming, NLP
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for token-by-token streaming text generation using the begin/stream two-method pattern, provided by exllamav2.

Description

begin_stream_ex() and stream_ex() are the core methods of ExLlamaV2StreamingGenerator that implement the streaming generation loop.

begin_stream_ex() initializes a generation session:

  • Encodes or accepts pre-encoded input token IDs
  • Configures sampling settings, stop conditions, and optional features
  • Performs the prefill forward pass to populate the KV cache with prompt context
  • Sets up token healing, filters, banned strings, and probability tracking if requested

stream_ex() generates one step of output:

  • Runs a forward pass for the current token position
  • Samples the next token according to configured settings
  • Updates the KV cache with new key-value projections
  • Decodes the token to text, handling partial character buffering
  • Checks stop conditions (EOS token, stop strings, max tokens)
  • Returns a result dictionary with the text chunk and metadata

Usage

Use this two-method pattern for any streaming generation scenario. Call begin_stream_ex() once, then call stream_ex() in a loop until result["eos"] is True.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/generator/streaming.py
  • Lines: L241-396 (begin_stream_ex), L421-481 (stream_ex)

Signature

def begin_stream_ex(
    self,
    input_ids: torch.Tensor,
    gen_settings: ExLlamaV2Sampler.Settings,
    token_healing: bool = False,
    loras: list | None = None,
    input_mask: torch.Tensor | None = None,
    position_offsets: torch.Tensor | None = None,
    return_probabilities: bool = False,
    return_top_tokens: int = 0,
    return_logits: bool = False,
    abort_event: threading.Event | None = None,
    input_embeddings: torch.Tensor | None = None,
    decode_special_tokens: bool = False,
    banned_strings: list[str] | None = None,
    filters: list | None = None,
    filter_prefer_eos: bool = False,
    **kwargs,
):
    ...

def stream_ex(
    self,
    ban_tokens: torch.Tensor | None = None,
    **kwargs,
) -> dict:
    ...

Import

from exllamav2.generator import ExLlamaV2StreamingGenerator

# begin_stream_ex() and stream_ex() are methods on the generator instance

I/O Contract

begin_stream_ex Inputs

Name Type Required Description
input_ids torch.Tensor Yes Token IDs for the prompt, shape (1, seq_len)
gen_settings ExLlamaV2Sampler.Settings Yes Sampling configuration (temperature, top-k, top-p, etc.)
token_healing bool No (default False) Rewind last prompt token to avoid tokenization artifacts
loras list or None No (default None) LoRA adapters to apply during generation
input_mask torch.Tensor or None No (default None) Custom attention mask for the input
position_offsets torch.Tensor or None No (default None) Custom position offsets for the input
return_probabilities bool No (default False) Include token probabilities in stream_ex results
return_top_tokens int No (default 0) Number of top tokens (with probabilities) to include in results; 0 = disabled
return_logits bool No (default False) Include raw logits in stream_ex results
abort_event threading.Event or None No (default None) Event to signal early termination from another thread
input_embeddings torch.Tensor or None No (default None) Custom embeddings to prepend to input (for multimodal)
decode_special_tokens bool No (default False) Include special tokens in decoded output text
banned_strings list[str] or None No (default None) Strings that must not appear in generated output
filters list or None No (default None) Constrained decoding filters (grammar, JSON schema, regex)
filter_prefer_eos bool No (default False) Prefer EOS when filters allow it

stream_ex Inputs

Name Type Required Description
ban_tokens torch.Tensor or None No (default None) Token IDs to ban for this step only

stream_ex Outputs

Name Type Description
result dict Dictionary containing generation results for this step
result["chunk"] str Decoded text fragment generated in this step (may be empty for partial characters)
result["eos"] bool True if generation has ended (EOS token, stop condition, or max tokens)
result["chunk_token_ids"] torch.Tensor Token ID(s) generated in this step, shape (1, n)
result["probs"] torch.Tensor (Optional) Probability of selected token(s); only if return_probabilities=True
result["top_tokens"] list[tuple] (Optional) Top-k (token_id, probability) pairs; only if return_top_tokens > 0
result["logits"] torch.Tensor (Optional) Raw logits over vocabulary; only if return_logits=True

Usage Examples

Basic Streaming Loop

from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

input_ids = tokenizer.encode("Once upon a time", add_bos=True)

generator.begin_stream_ex(input_ids, settings)

output = ""
while True:
    result = generator.stream_ex()
    output += result["chunk"]
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Streaming with Probabilities

input_ids = tokenizer.encode("The capital of France is", add_bos=True)

generator.begin_stream_ex(
    input_ids,
    settings,
    return_probabilities=True,
    return_top_tokens=5,
)

while True:
    result = generator.stream_ex()

    if result["chunk"]:
        prob = result["probs"][0].item() if "probs" in result else 0
        print(f"{result['chunk']} (p={prob:.3f})", end="", flush=True)

    if "top_tokens" in result:
        for token_id, prob in result["top_tokens"][:3]:
            token_str = tokenizer.decode(torch.tensor([[token_id]]))
            print(f"  [{token_str}: {prob:.3f}]", end="")

    if result["eos"]:
        break
print()

Streaming with Token Healing and Stop Conditions

generator.set_stop_conditions([
    tokenizer.eos_token_id,
    "User:",
    "\n\n",
])

input_ids = tokenizer.encode(
    "User: What is Python?\nAssistant:",
    add_bos=True,
)

generator.begin_stream_ex(
    input_ids,
    settings,
    token_healing=True,
)

response = ""
while True:
    result = generator.stream_ex()
    response += result["chunk"]
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Streaming with Banned Strings

input_ids = tokenizer.encode("Write a story about a dog:", add_bos=True)

generator.begin_stream_ex(
    input_ids,
    settings,
    banned_strings=["cat", "feline", "meow"],
)

while True:
    result = generator.stream_ex()
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment