Implementation:Turboderp org Exllamav2 Stream Ex

Knowledge Sources	ExLlamaV2
Domains	Text_Generation, Streaming, NLP
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for token-by-token streaming text generation using the begin/stream two-method pattern, provided by exllamav2.

Description

begin_stream_ex() and stream_ex() are the core methods of ExLlamaV2StreamingGenerator that implement the streaming generation loop.

begin_stream_ex() initializes a generation session:

Encodes or accepts pre-encoded input token IDs
Configures sampling settings, stop conditions, and optional features
Performs the prefill forward pass to populate the KV cache with prompt context
Sets up token healing, filters, banned strings, and probability tracking if requested

stream_ex() generates one step of output:

Runs a forward pass for the current token position
Samples the next token according to configured settings
Updates the KV cache with new key-value projections
Decodes the token to text, handling partial character buffering
Checks stop conditions (EOS token, stop strings, max tokens)
Returns a result dictionary with the text chunk and metadata

Usage

Use this two-method pattern for any streaming generation scenario. Call begin_stream_ex() once, then call stream_ex() in a loop until result["eos"] is True.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/generator/streaming.py
Lines: L241-396 (begin_stream_ex), L421-481 (stream_ex)

Signature

def begin_stream_ex(
    self,
    input_ids: torch.Tensor,
    gen_settings: ExLlamaV2Sampler.Settings,
    token_healing: bool = False,
    loras: list | None = None,
    input_mask: torch.Tensor | None = None,
    position_offsets: torch.Tensor | None = None,
    return_probabilities: bool = False,
    return_top_tokens: int = 0,
    return_logits: bool = False,
    abort_event: threading.Event | None = None,
    input_embeddings: torch.Tensor | None = None,
    decode_special_tokens: bool = False,
    banned_strings: list[str] | None = None,
    filters: list | None = None,
    filter_prefer_eos: bool = False,
    **kwargs,
):
    ...

def stream_ex(
    self,
    ban_tokens: torch.Tensor | None = None,
    **kwargs,
) -> dict:
    ...

Import

from exllamav2.generator import ExLlamaV2StreamingGenerator

# begin_stream_ex() and stream_ex() are methods on the generator instance

I/O Contract

begin_stream_ex Inputs

Name	Type	Required	Description
input_ids	torch.Tensor	Yes	Token IDs for the prompt, shape (1, seq_len)
gen_settings	ExLlamaV2Sampler.Settings	Yes	Sampling configuration (temperature, top-k, top-p, etc.)
token_healing	bool	No (default False)	Rewind last prompt token to avoid tokenization artifacts
loras	list or None	No (default None)	LoRA adapters to apply during generation
input_mask	torch.Tensor or None	No (default None)	Custom attention mask for the input
position_offsets	torch.Tensor or None	No (default None)	Custom position offsets for the input
return_probabilities	bool	No (default False)	Include token probabilities in stream_ex results
return_top_tokens	int	No (default 0)	Number of top tokens (with probabilities) to include in results; 0 = disabled
return_logits	bool	No (default False)	Include raw logits in stream_ex results
abort_event	threading.Event or None	No (default None)	Event to signal early termination from another thread
input_embeddings	torch.Tensor or None	No (default None)	Custom embeddings to prepend to input (for multimodal)
decode_special_tokens	bool	No (default False)	Include special tokens in decoded output text
banned_strings	list[str] or None	No (default None)	Strings that must not appear in generated output
filters	list or None	No (default None)	Constrained decoding filters (grammar, JSON schema, regex)
filter_prefer_eos	bool	No (default False)	Prefer EOS when filters allow it

stream_ex Inputs

Name	Type	Required	Description
ban_tokens	torch.Tensor or None	No (default None)	Token IDs to ban for this step only

stream_ex Outputs

Name	Type	Description
result	dict	Dictionary containing generation results for this step
result["chunk"]	str	Decoded text fragment generated in this step (may be empty for partial characters)
result["eos"]	bool	True if generation has ended (EOS token, stop condition, or max tokens)
result["chunk_token_ids"]	torch.Tensor	Token ID(s) generated in this step, shape (1, n)
result["probs"]	torch.Tensor	(Optional) Probability of selected token(s); only if return_probabilities=True
result["top_tokens"]	list[tuple]	(Optional) Top-k (token_id, probability) pairs; only if return_top_tokens > 0
result["logits"]	torch.Tensor	(Optional) Raw logits over vocabulary; only if return_logits=True

Usage Examples

Basic Streaming Loop

from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

generator = ExLlamaV2StreamingGenerator(model, cache, tokenizer)
generator.set_stop_conditions([tokenizer.eos_token_id])

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

input_ids = tokenizer.encode("Once upon a time", add_bos=True)

generator.begin_stream_ex(input_ids, settings)

output = ""
while True:
    result = generator.stream_ex()
    output += result["chunk"]
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Streaming with Probabilities

input_ids = tokenizer.encode("The capital of France is", add_bos=True)

generator.begin_stream_ex(
    input_ids,
    settings,
    return_probabilities=True,
    return_top_tokens=5,
)

while True:
    result = generator.stream_ex()

    if result["chunk"]:
        prob = result["probs"][0].item() if "probs" in result else 0
        print(f"{result['chunk']} (p={prob:.3f})", end="", flush=True)

    if "top_tokens" in result:
        for token_id, prob in result["top_tokens"][:3]:
            token_str = tokenizer.decode(torch.tensor([[token_id]]))
            print(f"  [{token_str}: {prob:.3f}]", end="")

    if result["eos"]:
        break
print()

Streaming with Token Healing and Stop Conditions

generator.set_stop_conditions([
    tokenizer.eos_token_id,
    "User:",
    "\n\n",
])

input_ids = tokenizer.encode(
    "User: What is Python?\nAssistant:",
    add_bos=True,
)

generator.begin_stream_ex(
    input_ids,
    settings,
    token_healing=True,
)

response = ""
while True:
    result = generator.stream_ex()
    response += result["chunk"]
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Streaming with Banned Strings

input_ids = tokenizer.encode("Write a story about a dog:", add_bos=True)

generator.begin_stream_ex(
    input_ids,
    settings,
    banned_strings=["cat", "feline", "meow"],
)

while True:
    result = generator.stream_ex()
    print(result["chunk"], end="", flush=True)
    if result["eos"]:
        break
print()

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Streaming_Output

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment