Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 ExLlamaV2BaseGenerator

From Leeroopedia
Knowledge Sources
Domains Text Generation, Inference
Last Updated 2026-02-15 00:00 GMT

Overview

ExLlamaV2BaseGenerator provides a synchronous, batch-capable text generation interface built on top of an ExLlamaV2 model, a KV cache, and a tokenizer.

Description

ExLlamaV2BaseGenerator is the simplest generation entry point in the ExLlamaV2 library. It wraps the model's forward pass, token sampling, and decoding loop into the single generate_simple() method. The class holds references to an ExLlamaV2 model, an ExLlamaV2CacheBase cache, and an ExLlamaV2Tokenizer, along with internal state for tracking the running sequence (sequence_ids) and an optional abort_event for cancellation.

The generate_simple() method accepts either a single prompt string or a list of prompt strings for batched inference. It handles:

  • Token healing: Removes the last token of the prompt and regenerates it with a prefix constraint to avoid tokenization-boundary artifacts.
  • Input embeddings: Supports injecting external embeddings (e.g. from a vision encoder) at a placeholder position marked by Template:EMBED HERE in the prompt string.
  • LoRA application: Accepts one or more ExLlamaV2Lora adapters that are applied during both prefill and generation.
  • Filters: Applies ExLlamaV2Filter objects for constrained decoding (e.g. JSON schema, grammar-guided generation).
  • Post-sampling hooks: Invokes ExLlamaV2PostSamplingHook callbacks after each token is sampled, allowing external logic to inspect or modify the selected token.
  • Abort support: An optional threading.Event can be passed to cancel long-running generation mid-stream.
  • Prompt truncation: Automatically truncates the prompt from the left if the total sequence (prompt + num_tokens) would exceed max_seq_len.

The internal _gen_begin_base() method performs the prefill pass (all prompt tokens except the last) to populate the cache before the autoregressive loop begins.

Usage

Use ExLlamaV2BaseGenerator for simple, synchronous generation tasks such as benchmarking, single-turn completions, or batch inference. For streaming output or dynamic batching with paged attention, use ExLlamaV2DynamicGenerator or ExLlamaV2StreamingGenerator instead.

Code Reference

Source Location

Signature

class ExLlamaV2BaseGenerator:

    model: ExLlamaV2
    cache: ExLlamaV2CacheBase
    tokenizer: ExLlamaV2Tokenizer
    sequence_ids: torch.Tensor | None
    abort_event: threading.Event | None

    def __init__(
        self,
        model: ExLlamaV2,
        cache: ExLlamaV2CacheBase,
        tokenizer: ExLlamaV2Tokenizer
    ): ...

    def warmup(self): ...
    def full(self) -> bool: ...

    def generate_simple(
        self,
        prompt: str | list,
        gen_settings: ExLlamaV2Sampler.Settings,
        num_tokens: int,
        seed: int | None = None,
        token_healing: bool = False,
        encode_special_tokens: bool = False,
        decode_special_tokens: bool = False,
        loras: ExLlamaV2Lora | list[ExLlamaV2Lora] | None = None,
        stop_token: int | None = -1,
        add_bos: bool = False,
        abort_event: threading.Event | None = None,
        input_embeddings: torch.Tensor | None = None,
        completion_only: bool = False,
        filters: list[ExLlamaV2Filter] | None = None,
        filter_prefer_eos: bool = False,
    ) -> str | list[str]: ...

Import

from exllamav2.generator import ExLlamaV2BaseGenerator

I/O Contract

Inputs (generate_simple)

Name Type Required Description
prompt str or list[str] Yes Input prompt(s). A list triggers batched generation; batch_size equals len(prompt).
gen_settings ExLlamaV2Sampler.Settings Yes Sampling settings (temperature, top-k, top-p, repetition penalty, post_sampling_hooks, etc.)
num_tokens int Yes Maximum number of tokens to generate
seed int or None No Random seed for sampling RNG (does not guarantee full determinism)
token_healing bool No (default False) Regenerate the last prompt token with a prefix constraint to heal tokenization boundaries
encode_special_tokens bool No (default False) If True, special tokens represented as text in the prompt are encoded as actual special tokens
decode_special_tokens bool No (default False) If True, special tokens in the output are decoded to their text representation
loras ExLlamaV2Lora or list or None No LoRA adapter(s) to apply during prefill and generation
stop_token int or None No (default -1) Token ID that terminates generation; -1 means the tokenizer's EOS token; None disables stop token
add_bos bool No (default False) Prepend the tokenizer's BOS token to the encoded prompt
abort_event threading.Event or None No Event to signal generation cancellation from another thread
input_embeddings torch.Tensor or None No Tensor of shape (batch_size, n, hidden_size) inserted at the Template:EMBED HERE placeholder
completion_only bool No (default False) If True, only the generated completion is returned (prompt text excluded)
filters list[ExLlamaV2Filter] or None No Constrained-decoding filters applied during sampling
filter_prefer_eos bool No (default False) If True, always sample EOS as soon as it is allowed by the active filters

Outputs

Name Type Description
result str or list[str] Generated text. Returns a single string when prompt is a string; returns a list of strings when prompt is a list.

Usage Examples

Basic Usage

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2BaseGenerator, ExLlamaV2Sampler

config = ExLlamaV2Config("/path/to/model")
model = ExLlamaV2(config)
model.load()

tokenizer = ExLlamaV2Tokenizer(config)
cache = ExLlamaV2Cache(model, max_seq_len=4096)

generator = ExLlamaV2BaseGenerator(model, cache, tokenizer)
generator.warmup()

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.7
settings.top_p = 0.9

output = generator.generate_simple(
    prompt="Explain quantum computing in simple terms:",
    gen_settings=settings,
    num_tokens=200,
    token_healing=True,
    completion_only=True
)
print(output)

Batched Generation

prompts = [
    "Write a haiku about mountains:",
    "Write a haiku about the ocean:",
    "Write a haiku about the stars:",
]

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.8

outputs = generator.generate_simple(
    prompt=prompts,
    gen_settings=settings,
    num_tokens=50,
    seed=42
)

for prompt, output in zip(prompts, outputs):
    print(f"{prompt}\n{output}\n")

Generation with LoRA and Abort

import threading
from exllamav2 import ExLlamaV2Lora

lora = ExLlamaV2Lora.from_directory(model, "/path/to/lora")
abort = threading.Event()

# In another thread: abort.set() to cancel generation

output = generator.generate_simple(
    prompt="Tell me a story:",
    gen_settings=settings,
    num_tokens=500,
    loras=lora,
    abort_event=abort,
    completion_only=True
)

Key Methods

Method Description
__init__(model, cache, tokenizer) Stores references to the model, cache, and tokenizer; initialises sequence_ids to None
warmup() Runs a single forward pass with dummy input to ensure CUDA is fully initialised
full() Returns True if the current sequence has reached max_seq_len
generate_simple(...) Main generation method: tokenizes prompt, prefills cache, runs autoregressive sampling loop, decodes and returns text
_gen_begin_base(input_ids, mask, loras, ...) Internal method that resets the cache and runs the prefill forward pass on all prompt tokens except the last

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment