Implementation:Turboderp org Exllamav2 ExLlamaV2BaseGenerator

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Text Generation, Inference
Last Updated	2026-02-15 00:00 GMT

Overview

ExLlamaV2BaseGenerator provides a synchronous, batch-capable text generation interface built on top of an ExLlamaV2 model, a KV cache, and a tokenizer.

Description

ExLlamaV2BaseGenerator is the simplest generation entry point in the ExLlamaV2 library. It wraps the model's forward pass, token sampling, and decoding loop into the single generate_simple() method. The class holds references to an ExLlamaV2 model, an ExLlamaV2CacheBase cache, and an ExLlamaV2Tokenizer, along with internal state for tracking the running sequence (sequence_ids) and an optional abort_event for cancellation.

The generate_simple() method accepts either a single prompt string or a list of prompt strings for batched inference. It handles:

Token healing: Removes the last token of the prompt and regenerates it with a prefix constraint to avoid tokenization-boundary artifacts.
Input embeddings: Supports injecting external embeddings (e.g. from a vision encoder) at a placeholder position marked by Template:EMBED HERE in the prompt string.
LoRA application: Accepts one or more ExLlamaV2Lora adapters that are applied during both prefill and generation.
Filters: Applies ExLlamaV2Filter objects for constrained decoding (e.g. JSON schema, grammar-guided generation).
Post-sampling hooks: Invokes ExLlamaV2PostSamplingHook callbacks after each token is sampled, allowing external logic to inspect or modify the selected token.
Abort support: An optional threading.Event can be passed to cancel long-running generation mid-stream.
Prompt truncation: Automatically truncates the prompt from the left if the total sequence (prompt + num_tokens) would exceed max_seq_len.

The internal _gen_begin_base() method performs the prefill pass (all prompt tokens except the last) to populate the cache before the autoregressive loop begins.

Usage

Use ExLlamaV2BaseGenerator for simple, synchronous generation tasks such as benchmarking, single-turn completions, or batch inference. For streaming output or dynamic batching with paged attention, use ExLlamaV2DynamicGenerator or ExLlamaV2StreamingGenerator instead.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/generator/base.py
Lines: 20-355

Signature

class ExLlamaV2BaseGenerator:

    model: ExLlamaV2
    cache: ExLlamaV2CacheBase
    tokenizer: ExLlamaV2Tokenizer
    sequence_ids: torch.Tensor | None
    abort_event: threading.Event | None

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer
    ): ...

    def warmup(self): ...
    def full(self) -> bool: ...

    def generate_simple(
        self,
        prompt: str | list,
        gen_settings: ExLlamaV2Sampler.Settings,
        num_tokens: int,
        seed: int | None = None,
        token_healing: bool = False,
        encode_special_tokens: bool = False,
        decode_special_tokens: bool = False,
        loras: ExLlamaV2Lora | list[ExLlamaV2Lora] | None = None,
        stop_token: int | None = -1,
        add_bos: bool = False,
        abort_event: threading.Event | None = None,
        input_embeddings: torch.Tensor | None = None,
        completion_only: bool = False,
        filters: list[ExLlamaV2Filter] | None = None,
        filter_prefer_eos: bool = False,
    ) -> str | list[str]: ...

Import

from exllamav2.generator import ExLlamaV2BaseGenerator

I/O Contract

Inputs (generate_simple)

Name	Type	Required	Description
prompt	str or list[str]	Yes	Input prompt(s). A list triggers batched generation; batch_size equals len(prompt).
gen_settings	ExLlamaV2Sampler.Settings	Yes	Sampling settings (temperature, top-k, top-p, repetition penalty, post_sampling_hooks, etc.)
num_tokens	int	Yes	Maximum number of tokens to generate
seed	int or None	No	Random seed for sampling RNG (does not guarantee full determinism)
token_healing	bool	No (default False)	Regenerate the last prompt token with a prefix constraint to heal tokenization boundaries
encode_special_tokens	bool	No (default False)	If True, special tokens represented as text in the prompt are encoded as actual special tokens
decode_special_tokens	bool	No (default False)	If True, special tokens in the output are decoded to their text representation
loras	ExLlamaV2Lora or list or None	No	LoRA adapter(s) to apply during prefill and generation
stop_token	int or None	No (default -1)	Token ID that terminates generation; -1 means the tokenizer's EOS token; None disables stop token
add_bos	bool	No (default False)	Prepend the tokenizer's BOS token to the encoded prompt
abort_event	threading.Event or None	No	Event to signal generation cancellation from another thread
input_embeddings	torch.Tensor or None	No	Tensor of shape (batch_size, n, hidden_size) inserted at the `Template:EMBED HERE` placeholder
completion_only	bool	No (default False)	If True, only the generated completion is returned (prompt text excluded)
filters	list[ExLlamaV2Filter] or None	No	Constrained-decoding filters applied during sampling
filter_prefer_eos	bool	No (default False)	If True, always sample EOS as soon as it is allowed by the active filters

Outputs

Name	Type	Description
result	str or list[str]	Generated text. Returns a single string when prompt is a string; returns a list of strings when prompt is a list.

Usage Examples

Basic Usage

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

config = ExLlamaV2Config("/path/to/model")
model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
generator.warmup()

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

output = generator.generate_simple(
    prompt="Explain quantum computing in simple terms:",
    gen_settings=settings,
    num_tokens=200,
    token_healing=True,
    completion_only=True
)
print(output)

Batched Generation

prompts = [
    "Write a haiku about mountains:",
    "Write a haiku about the ocean:",
    "Write a haiku about the stars:",
]

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.8

outputs = generator.generate_simple(
    prompt=prompts,
    gen_settings=settings,
    num_tokens=50,
    seed=42
)

for prompt, output in zip(prompts, outputs):
    print(f"{prompt}\n{output}\n")

Generation with LoRA and Abort

import threading
from exllamav2 import ExLlamaV2Lora

lora = ExLlamaV2Lora.from_directory(model, "/path/to/lora")
abort = threading.Event()

# In another thread: abort.set() to cancel generation

output = generator.generate_simple(
    prompt="Tell me a story:",
    gen_settings=settings,
    num_tokens=500,
    loras=lora,
    abort_event=abort,
    completion_only=True
)

Key Methods

Method	Description
__init__(model, cache, tokenizer)	Stores references to the model, cache, and tokenizer; initialises sequence_ids to None
warmup()	Runs a single forward pass with dummy input to ensure CUDA is fully initialised
full()	Returns True if the current sequence has reached max_seq_len
generate_simple(...)	Main generation method: tokenizes prompt, prefills cache, runs autoregressive sampling loop, decodes and returns text
_gen_begin_base(input_ids, mask, loras, ...)	Internal method that resets the cache and runs the prefill forward pass on all prompt tokens except the last

Related Pages

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment