Implementation:Turboderp org Exllamav2 ExLlamaV2DynamicGenerator Generate
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, NLP, Deep_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for high-level blocking text generation using the dynamic batching generator, provided by exllamav2.
Description
generate() is the primary text generation method on ExLlamaV2DynamicGenerator. It provides a simple blocking API that handles prompt encoding, job scheduling, iterative decoding, and result collection. The method accepts one or more text prompts and returns the generated completions as strings.
Internally, generate() creates generation jobs, enqueues them in the dynamic generator's paged-attention system, and iterates until all jobs complete. For batched prompts, all generations run concurrently within the same iteration loop, maximizing GPU utilization.
The method supports the full range of generation controls: sampling settings, stop conditions, token healing, banned strings, constrained decoding filters, and classifier-free guidance.
Usage
Use generate() for any blocking text generation task:
- Single prompt completion
- Batch generation of multiple prompts
- Constrained generation with grammar/JSON filters
- Generation with custom sampling parameters
Code Reference
Source Location
- Repository: exllamav2
- File: exllamav2/generator/dynamic.py
- Lines: L541-627
Signature
def generate(
self,
prompt: str | list[str],
max_new_tokens: int,
min_new_tokens: int = 0,
seed: int | None = None,
gen_settings: ExLlamaV2Sampler.Settings | None = None,
token_healing: bool = False,
encode_special_tokens: bool = False,
decode_special_tokens: bool = False,
stop_conditions: list | None = None,
add_bos: bool = False,
abort_event: threading.Event | None = None,
completion_only: bool = False,
filters: list | None = None,
filter_prefer_eos: bool = False,
return_last_results: bool = False,
embeddings: list | None = None,
banned_strings: list[str] | None = None,
**kwargs,
) -> str | list[str]:
...
Import
from exllamav2.generator import ExLlamaV2DynamicGenerator
# generate() is a method on the ExLlamaV2DynamicGenerator instance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str or list[str] | Yes | Text prompt(s) to generate from. List input produces batched generation. |
| max_new_tokens | int | Yes | Maximum number of tokens to generate per prompt |
| min_new_tokens | int | No (default 0) | Minimum tokens to generate before allowing stop conditions |
| seed | int or None | No (default None) | Random seed for reproducible sampling |
| gen_settings | ExLlamaV2Sampler.Settings | No (default None) | Sampling configuration (temperature, top-k, top-p, etc.); None uses defaults |
| token_healing | bool | No (default False) | Enable token healing at prompt-completion boundary |
| encode_special_tokens | bool | No (default False) | im_start|>) |
| decode_special_tokens | bool | No (default False) | Include special tokens in output text |
| stop_conditions | list or None | No (default None) | List of stop strings or token IDs that terminate generation |
| add_bos | bool | No (default False) | Prepend beginning-of-sequence token to prompt |
| abort_event | threading.Event or None | No (default None) | Event to signal early termination from another thread |
| completion_only | bool | No (default False) | Return only the generated text, excluding the prompt |
| filters | list or None | No (default None) | Constrained decoding filters (grammar, JSON schema, regex) |
| filter_prefer_eos | bool | No (default False) | Prefer EOS token when filters allow it |
| return_last_results | bool | No (default False) | Return detailed result objects instead of plain strings |
| banned_strings | list[str] or None | No (default None) | Strings that must not appear in the generated output |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str | Generated text (when prompt is a single string) |
| results | list[str] | List of generated texts (when prompt is a list of strings) |
| results (detailed) | list[dict] | Detailed result objects when return_last_results=True, containing text, token counts, stop reasons, etc. |
Usage Examples
Simple Generation
from exllamav2.generator import ExLlamaV2DynamicGenerator
# Assuming generator is already initialized
output = generator.generate(
prompt="Once upon a time",
max_new_tokens=200,
add_bos=True,
)
print(output)
Generation with Sampling Settings
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
settings.top_k = 50
settings.token_repetition_penalty = 1.05
output = generator.generate(
prompt="Explain quantum computing:",
max_new_tokens=500,
gen_settings=settings,
stop_conditions=[tokenizer.eos_token_id],
add_bos=True,
)
print(output)
Batched Generation
prompts = [
"Write a haiku about mountains:",
"Write a haiku about the ocean:",
"Write a haiku about the stars:",
]
outputs = generator.generate(
prompt=prompts,
max_new_tokens=50,
gen_settings=ExLlamaV2Sampler.Settings.greedy(),
add_bos=True,
)
for prompt, output in zip(prompts, outputs):
print(f"{prompt}\n{output}\n")
Generation with Stop Conditions
output = generator.generate(
prompt="User: What is Python?\nAssistant:",
max_new_tokens=300,
stop_conditions=[
tokenizer.eos_token_id,
"User:",
"\n\n",
],
token_healing=True,
add_bos=True,
completion_only=True,
)
print(output)
Related Pages
Implements Principle
Requires Environment
- Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime
- Environment:Turboderp_org_Exllamav2_Flash_Attention_Backend