Implementation:Turboderp org Exllamav2 ExLlamaV2Filter

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Filtering, Constrained_Generation
Last Updated	2026-02-15 00:00 GMT

Overview

Abstract base class for all token filters in ExLlamaV2's generation pipeline, defining the interface for constraining token selection during autoregressive decoding.

Description

ExLlamaV2Filter establishes the contract that every token filter must implement. Filters restrict which tokens the sampler may select at each generation step, enabling constrained generation patterns such as grammar enforcement, prefix matching, and multiple-choice selection.

Key methods:

__init__(model, tokenizer) -- Stores references to the ExLlamaV2 model and ExLlamaV2Tokenizer, initialises sequence_str to an empty string and background_result to None.
clone(c=None) -- Creates a shallow copy of the filter, preserving model, tokenizer, and sequence state. Subclasses override this to copy their own state.
begin(prefix_str) -- Abstract method called at the start of each generation to reset filter state. Must be overridden by subclasses.
feed(token) -- Abstract method called after each token is selected, allowing the filter to update its internal state. Must be overridden by subclasses.
next() -- Abstract method that returns a tuple of (allowed_token_ids, end_token_ids) for the current state. Must be overridden by subclasses.
use_background_worker() -- Returns False by default. Subclasses return True to indicate that next() should be scheduled asynchronously to overlap with CUDA forward passes. Recommended for CPU-intensive filters like grammar constraints.
can_mask_logits() -- Returns False by default. Subclasses return True if they can directly manipulate the logit tensor rather than returning token ID sets.
prepare_logit_mask() -- Called in place of next() to precompute a logit mask when can_mask_logits() is True.
mask_logits(logits) -- Directly sets excluded logits to negative infinity in the provided tensor.
background_next(pool) -- Submits next() to a ThreadPoolExecutor for asynchronous execution.
background_drop() -- Clears any pending asynchronous result, used when a filter forces EOS selection.
get_next(mask=False) -- Returns the result of next() (or prepare_logit_mask()) either directly or from a pending background computation.

Usage

Use ExLlamaV2Filter as the base class when implementing custom token filters. All built-in filters (ExLlamaV2PrefixFilter, ExLlamaV2SelectFilter, ExLlamaV2GrammarFilter, etc.) extend this class. Instances are passed to ExLlamaV2DynamicJob or sampler settings via the filters parameter.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/generator/filters/base.py
Lines: L1-119

Signature

class ExLlamaV2Filter:

    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    sequence_str: str
    background_result: Future | None
    allow_return_type_list: bool = True

    def __init__(self,
                 model: ExLlamaV2,
                 tokenizer: ExLlamaV2Tokenizer):
        ...

    def clone(self, c=None) -> ExLlamaV2Filter:
        ...

    def begin(self, prefix_str: str) -> None:
        ...

    def feed(self, token) -> None:
        ...

    def next(self) -> tuple[set | list, set | list]:
        ...

    def use_background_worker(self) -> bool:
        ...

    def can_mask_logits(self) -> bool:
        ...

    def prepare_logit_mask(self) -> bool:
        ...

    def mask_logits(self, logits: torch.Tensor) -> torch.Tensor:
        ...

    def background_next(self, pool: ThreadPoolExecutor) -> None:
        ...

    def get_next(self, mask: bool = False) -> tuple:
        ...

Import

from exllamav2.generator.filters import ExLlamaV2Filter

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	The loaded ExLlamaV2 model instance
tokenizer	ExLlamaV2Tokenizer	Yes	The tokenizer associated with the model
prefix_str	str	Yes (begin)	String prefix already generated before the filter begins constraining
token	int or tensor	Yes (feed)	The token ID selected by the sampler at each step
logits	torch.Tensor	Yes (mask_logits)	The raw logit tensor to mask in-place
pool	ThreadPoolExecutor	Yes (background_next)	Thread pool for scheduling asynchronous filter computation
mask	bool	No (get_next, default False)	If True, calls prepare_logit_mask() instead of next()

Outputs

Name	Type	Description
allowed_tokens, end_tokens	tuple[set or list, set or list]	From next(): the set of allowed token IDs and the set of tokens that would complete the constraint
use_background	bool	From use_background_worker(): whether to run next() asynchronously
can_mask	bool	From can_mask_logits(): whether the filter supports direct logit masking
masked_logits	torch.Tensor	From mask_logits(): logit tensor with disallowed positions set to -inf

Usage Examples

Implementing a Custom Filter

from exllamav2.generator.filters import ExLlamaV2Filter

class MyCustomFilter(ExLlamaV2Filter):

    def __init__(self, model, tokenizer, allowed_words):
        super().__init__(model, tokenizer)
        # Pre-encode allowed words to token IDs
        self.allowed_ids = set()
        for word in allowed_words:
            ids = tokenizer.encode(word, add_bos=False)
            for id in ids[0].tolist():
                self.allowed_ids.add(id)

    def begin(self, prefix_str):
        pass  # No state to reset

    def feed(self, token):
        pass  # No state to update

    def next(self):
        return self.allowed_ids, set()

Using a Filter with a DynamicJob

from exllamav2.generator import ExLlamaV2DynamicJob

my_filter = MyCustomFilter(model, tokenizer, ["yes", "no", "maybe"])

job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    gen_settings=gen_settings,
    max_new_tokens=10,
    filters=[my_filter],
)
generator.enqueue(job)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Constrained_Generation

Extended By

Used By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment