Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Generate

Knowledge Sources	ExLlamaV2
Domains	Text_Generation, NLP, Deep_Learning
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for high-level blocking text generation using the dynamic batching generator, provided by exllamav2.

Description

generate() is the primary text generation method on ExLlamaV2DynamicGenerator. It provides a simple blocking API that handles prompt encoding, job scheduling, iterative decoding, and result collection. The method accepts one or more text prompts and returns the generated completions as strings.

Internally, generate() creates generation jobs, enqueues them in the dynamic generator's paged-attention system, and iterates until all jobs complete. For batched prompts, all generations run concurrently within the same iteration loop, maximizing GPU utilization.

The method supports the full range of generation controls: sampling settings, stop conditions, token healing, banned strings, constrained decoding filters, and classifier-free guidance.

Usage

Use generate() for any blocking text generation task:

Single prompt completion
Batch generation of multiple prompts
Constrained generation with grammar/JSON filters
Generation with custom sampling parameters

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/generator/dynamic.py
Lines: L541-627

Signature

def generate(
    self,
    prompt: str | list[str],
    max_new_tokens: int,
    min_new_tokens: int = 0,
    seed: int | None = None,
    gen_settings: ExLlamaV2Sampler.Settings | None = None,
    token_healing: bool = False,
    encode_special_tokens: bool = False,
    decode_special_tokens: bool = False,
    stop_conditions: list | None = None,
    add_bos: bool = False,
    abort_event: threading.Event | None = None,
    completion_only: bool = False,
    filters: list | None = None,
    filter_prefer_eos: bool = False,
    return_last_results: bool = False,
    embeddings: list | None = None,
    banned_strings: list[str] | None = None,
    **kwargs,
) -> str | list[str]:
    ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator

# generate() is a method on the ExLlamaV2DynamicGenerator instance

I/O Contract

Inputs

Name	Type	Required	Description
prompt	str or list[str]	Yes	Text prompt(s) to generate from. List input produces batched generation.
max_new_tokens	int	Yes	Maximum number of tokens to generate per prompt
min_new_tokens	int	No (default 0)	Minimum tokens to generate before allowing stop conditions
seed	int or None	No (default None)	Random seed for reproducible sampling
gen_settings	ExLlamaV2Sampler.Settings	No (default None)	Sampling configuration (temperature, top-k, top-p, etc.); None uses defaults
token_healing	bool	No (default False)	Enable token healing at prompt-completion boundary
encode_special_tokens	bool	No (default False)	im_start\|>)
decode_special_tokens	bool	No (default False)	Include special tokens in output text
stop_conditions	list or None	No (default None)	List of stop strings or token IDs that terminate generation
add_bos	bool	No (default False)	Prepend beginning-of-sequence token to prompt
abort_event	threading.Event or None	No (default None)	Event to signal early termination from another thread
completion_only	bool	No (default False)	Return only the generated text, excluding the prompt
filters	list or None	No (default None)	Constrained decoding filters (grammar, JSON schema, regex)
filter_prefer_eos	bool	No (default False)	Prefer EOS token when filters allow it
return_last_results	bool	No (default False)	Return detailed result objects instead of plain strings
banned_strings	list[str] or None	No (default None)	Strings that must not appear in the generated output

Outputs

Name	Type	Description
result	str	Generated text (when prompt is a single string)
results	list[str]	List of generated texts (when prompt is a list of strings)
results (detailed)	list[dict]	Detailed result objects when return_last_results=True, containing text, token counts, stop reasons, etc.

Usage Examples

Simple Generation

from exllamav2.generator import ExLlamaV2DynamicGenerator

# Assuming generator is already initialized
output = generator.generate(
    prompt="Once upon a time",
    max_new_tokens=200,
    add_bos=True,
)
print(output)

Generation with Sampling Settings

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
settings.top_k = 50
settings.token_repetition_penalty = 1.05

output = generator.generate(
    prompt="Explain quantum computing:",
    max_new_tokens=500,
    gen_settings=settings,
    stop_conditions=[tokenizer.eos_token_id],
    add_bos=True,
)
print(output)

Batched Generation

prompts = [
    "Write a haiku about mountains:",
    "Write a haiku about the ocean:",
    "Write a haiku about the stars:",
]

outputs = generator.generate(
    prompt=prompts,
    max_new_tokens=50,
    gen_settings=ExLlamaV2Sampler.Settings.greedy(),
    add_bos=True,
)

for prompt, output in zip(prompts, outputs):
    print(f"{prompt}\n{output}\n")

Generation with Stop Conditions

output = generator.generate(
    prompt="User: What is Python?\nAssistant:",
    max_new_tokens=300,
    stop_conditions=[
        tokenizer.eos_token_id,
        "User:",
        "\n\n",
    ],
    token_healing=True,
    add_bos=True,
    completion_only=True,
)
print(output)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Dynamic_Text_Generation

Requires Environment

Uses Heuristic

Heuristic:Turboderp_org_Exllamav2_Dynamic_Generator_Tuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment