Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Generate

From Leeroopedia
Knowledge Sources
Domains Text_Generation, NLP, Deep_Learning
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for high-level blocking text generation using the dynamic batching generator, provided by exllamav2.

Description

generate() is the primary text generation method on ExLlamaV2DynamicGenerator. It provides a simple blocking API that handles prompt encoding, job scheduling, iterative decoding, and result collection. The method accepts one or more text prompts and returns the generated completions as strings.

Internally, generate() creates generation jobs, enqueues them in the dynamic generator's paged-attention system, and iterates until all jobs complete. For batched prompts, all generations run concurrently within the same iteration loop, maximizing GPU utilization.

The method supports the full range of generation controls: sampling settings, stop conditions, token healing, banned strings, constrained decoding filters, and classifier-free guidance.

Usage

Use generate() for any blocking text generation task:

  • Single prompt completion
  • Batch generation of multiple prompts
  • Constrained generation with grammar/JSON filters
  • Generation with custom sampling parameters

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/generator/dynamic.py
  • Lines: L541-627

Signature

def generate(
    self,
    prompt: str | list[str],
    max_new_tokens: int,
    min_new_tokens: int = 0,
    seed: int | None = None,
    gen_settings: ExLlamaV2Sampler.Settings | None = None,
    token_healing: bool = False,
    encode_special_tokens: bool = False,
    decode_special_tokens: bool = False,
    stop_conditions: list | None = None,
    add_bos: bool = False,
    abort_event: threading.Event | None = None,
    completion_only: bool = False,
    filters: list | None = None,
    filter_prefer_eos: bool = False,
    return_last_results: bool = False,
    embeddings: list | None = None,
    banned_strings: list[str] | None = None,
    **kwargs,
) -> str | list[str]:
    ...

Import

from exllamav2.generator import ExLlamaV2DynamicGenerator

# generate() is a method on the ExLlamaV2DynamicGenerator instance

I/O Contract

Inputs

Name Type Required Description
prompt str or list[str] Yes Text prompt(s) to generate from. List input produces batched generation.
max_new_tokens int Yes Maximum number of tokens to generate per prompt
min_new_tokens int No (default 0) Minimum tokens to generate before allowing stop conditions
seed int or None No (default None) Random seed for reproducible sampling
gen_settings ExLlamaV2Sampler.Settings No (default None) Sampling configuration (temperature, top-k, top-p, etc.); None uses defaults
token_healing bool No (default False) Enable token healing at prompt-completion boundary
encode_special_tokens bool No (default False) im_start|>)
decode_special_tokens bool No (default False) Include special tokens in output text
stop_conditions list or None No (default None) List of stop strings or token IDs that terminate generation
add_bos bool No (default False) Prepend beginning-of-sequence token to prompt
abort_event threading.Event or None No (default None) Event to signal early termination from another thread
completion_only bool No (default False) Return only the generated text, excluding the prompt
filters list or None No (default None) Constrained decoding filters (grammar, JSON schema, regex)
filter_prefer_eos bool No (default False) Prefer EOS token when filters allow it
return_last_results bool No (default False) Return detailed result objects instead of plain strings
banned_strings list[str] or None No (default None) Strings that must not appear in the generated output

Outputs

Name Type Description
result str Generated text (when prompt is a single string)
results list[str] List of generated texts (when prompt is a list of strings)
results (detailed) list[dict] Detailed result objects when return_last_results=True, containing text, token counts, stop reasons, etc.

Usage Examples

Simple Generation

from exllamav2.generator import ExLlamaV2DynamicGenerator

# Assuming generator is already initialized
output = generator.generate(
    prompt="Once upon a time",
    max_new_tokens=200,
    add_bos=True,
)
print(output)

Generation with Sampling Settings

from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
settings.top_k = 50
settings.token_repetition_penalty = 1.05

output = generator.generate(
    prompt="Explain quantum computing:",
    max_new_tokens=500,
    gen_settings=settings,
    stop_conditions=[tokenizer.eos_token_id],
    add_bos=True,
)
print(output)

Batched Generation

prompts = [
    "Write a haiku about mountains:",
    "Write a haiku about the ocean:",
    "Write a haiku about the stars:",
]

outputs = generator.generate(
    prompt=prompts,
    max_new_tokens=50,
    gen_settings=ExLlamaV2Sampler.Settings.greedy(),
    add_bos=True,
)

for prompt, output in zip(prompts, outputs):
    print(f"{prompt}\n{output}\n")

Generation with Stop Conditions

output = generator.generate(
    prompt="User: What is Python?\nAssistant:",
    max_new_tokens=300,
    stop_conditions=[
        tokenizer.eos_token_id,
        "User:",
        "\n\n",
    ],
    token_healing=True,
    add_bos=True,
    completion_only=True,
)
print(output)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment