Implementation:Turboderp org Exllamav2 ExLlamaV2BaseGenerator
| Knowledge Sources | |
|---|---|
| Domains | Text Generation, Inference |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
ExLlamaV2BaseGenerator provides a synchronous, batch-capable text generation interface built on top of an ExLlamaV2 model, a KV cache, and a tokenizer.
Description
ExLlamaV2BaseGenerator is the simplest generation entry point in the ExLlamaV2 library. It wraps the model's forward pass, token sampling, and decoding loop into the single generate_simple() method. The class holds references to an ExLlamaV2 model, an ExLlamaV2CacheBase cache, and an ExLlamaV2Tokenizer, along with internal state for tracking the running sequence (sequence_ids) and an optional abort_event for cancellation.
The generate_simple() method accepts either a single prompt string or a list of prompt strings for batched inference. It handles:
- Token healing: Removes the last token of the prompt and regenerates it with a prefix constraint to avoid tokenization-boundary artifacts.
- Input embeddings: Supports injecting external embeddings (e.g. from a vision encoder) at a placeholder position marked by
Template:EMBED HEREin the prompt string. - LoRA application: Accepts one or more ExLlamaV2Lora adapters that are applied during both prefill and generation.
- Filters: Applies ExLlamaV2Filter objects for constrained decoding (e.g. JSON schema, grammar-guided generation).
- Post-sampling hooks: Invokes ExLlamaV2PostSamplingHook callbacks after each token is sampled, allowing external logic to inspect or modify the selected token.
- Abort support: An optional threading.Event can be passed to cancel long-running generation mid-stream.
- Prompt truncation: Automatically truncates the prompt from the left if the total sequence (prompt + num_tokens) would exceed max_seq_len.
The internal _gen_begin_base() method performs the prefill pass (all prompt tokens except the last) to populate the cache before the autoregressive loop begins.
Usage
Use ExLlamaV2BaseGenerator for simple, synchronous generation tasks such as benchmarking, single-turn completions, or batch inference. For streaming output or dynamic batching with paged attention, use ExLlamaV2DynamicGenerator or ExLlamaV2StreamingGenerator instead.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/generator/base.py
- Lines: 20-355
Signature
class ExLlamaV2BaseGenerator:
model: ExLlamaV2
cache: ExLlamaV2CacheBase
tokenizer: ExLlamaV2Tokenizer
sequence_ids: torch.Tensor | None
abort_event: threading.Event | None
def __init__(
self,
model: ExLlamaV2,
cache: ExLlamaV2CacheBase,
tokenizer: ExLlamaV2Tokenizer
): ...
def warmup(self): ...
def full(self) -> bool: ...
def generate_simple(
self,
prompt: str | list,
gen_settings: ExLlamaV2Sampler.Settings,
num_tokens: int,
seed: int | None = None,
token_healing: bool = False,
encode_special_tokens: bool = False,
decode_special_tokens: bool = False,
loras: ExLlamaV2Lora | list[ExLlamaV2Lora] | None = None,
stop_token: int | None = -1,
add_bos: bool = False,
abort_event: threading.Event | None = None,
input_embeddings: torch.Tensor | None = None,
completion_only: bool = False,
filters: list[ExLlamaV2Filter] | None = None,
filter_prefer_eos: bool = False,
) -> str | list[str]: ...
Import
from exllamav2.generator import ExLlamaV2BaseGenerator
I/O Contract
Inputs (generate_simple)
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str or list[str] | Yes | Input prompt(s). A list triggers batched generation; batch_size equals len(prompt). |
| gen_settings | ExLlamaV2Sampler.Settings | Yes | Sampling settings (temperature, top-k, top-p, repetition penalty, post_sampling_hooks, etc.) |
| num_tokens | int | Yes | Maximum number of tokens to generate |
| seed | int or None | No | Random seed for sampling RNG (does not guarantee full determinism) |
| token_healing | bool | No (default False) | Regenerate the last prompt token with a prefix constraint to heal tokenization boundaries |
| encode_special_tokens | bool | No (default False) | If True, special tokens represented as text in the prompt are encoded as actual special tokens |
| decode_special_tokens | bool | No (default False) | If True, special tokens in the output are decoded to their text representation |
| loras | ExLlamaV2Lora or list or None | No | LoRA adapter(s) to apply during prefill and generation |
| stop_token | int or None | No (default -1) | Token ID that terminates generation; -1 means the tokenizer's EOS token; None disables stop token |
| add_bos | bool | No (default False) | Prepend the tokenizer's BOS token to the encoded prompt |
| abort_event | threading.Event or None | No | Event to signal generation cancellation from another thread |
| input_embeddings | torch.Tensor or None | No | Tensor of shape (batch_size, n, hidden_size) inserted at the Template:EMBED HERE placeholder
|
| completion_only | bool | No (default False) | If True, only the generated completion is returned (prompt text excluded) |
| filters | list[ExLlamaV2Filter] or None | No | Constrained-decoding filters applied during sampling |
| filter_prefer_eos | bool | No (default False) | If True, always sample EOS as soon as it is allowed by the active filters |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str or list[str] | Generated text. Returns a single string when prompt is a string; returns a list of strings when prompt is a list. |
Usage Examples
Basic Usage
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler
config = ExLlamaV2Config("/path/to/model")
model = ExLlamaV2(config)
model.load()
tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)
generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
generator.warmup()
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9
output = generator.generate_simple(
prompt="Explain quantum computing in simple terms:",
gen_settings=settings,
num_tokens=200,
token_healing=True,
completion_only=True
)
print(output)
Batched Generation
prompts = [
"Write a haiku about mountains:",
"Write a haiku about the ocean:",
"Write a haiku about the stars:",
]
settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.8
outputs = generator.generate_simple(
prompt=prompts,
gen_settings=settings,
num_tokens=50,
seed=42
)
for prompt, output in zip(prompts, outputs):
print(f"{prompt}\n{output}\n")
Generation with LoRA and Abort
import threading
from exllamav2 import ExLlamaV2Lora
lora = ExLlamaV2Lora.from_directory(model, "/path/to/lora")
abort = threading.Event()
# In another thread: abort.set() to cancel generation
output = generator.generate_simple(
prompt="Tell me a story:",
gen_settings=settings,
num_tokens=500,
loras=lora,
abort_event=abort,
completion_only=True
)
Key Methods
| Method | Description |
|---|---|
| __init__(model, cache, tokenizer) | Stores references to the model, cache, and tokenizer; initialises sequence_ids to None |
| warmup() | Runs a single forward pass with dummy input to ensure CUDA is fully initialised |
| full() | Returns True if the current sequence has reached max_seq_len |
| generate_simple(...) | Main generation method: tokenizes prompt, prefills cache, runs autoregressive sampling loop, decodes and returns text |
| _gen_begin_base(input_ids, mask, loras, ...) | Internal method that resets the cache and runs the prefill forward pass on all prompt tokens except the last |