Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2Filter

From Leeroopedia
Knowledge Sources
Domains Filtering, Constrained_Generation
Last Updated 2026-02-15 00:00 GMT

Overview

Abstract base class for all token filters in ExLlamaV2's generation pipeline, defining the interface for constraining token selection during autoregressive decoding.

Description

ExLlamaV2Filter establishes the contract that every token filter must implement. Filters restrict which tokens the sampler may select at each generation step, enabling constrained generation patterns such as grammar enforcement, prefix matching, and multiple-choice selection.

Key methods:

  • __init__(model, tokenizer) -- Stores references to the ExLlamaV2 model and ExLlamaV2Tokenizer, initialises sequence_str to an empty string and background_result to None.
  • clone(c=None) -- Creates a shallow copy of the filter, preserving model, tokenizer, and sequence state. Subclasses override this to copy their own state.
  • begin(prefix_str) -- Abstract method called at the start of each generation to reset filter state. Must be overridden by subclasses.
  • feed(token) -- Abstract method called after each token is selected, allowing the filter to update its internal state. Must be overridden by subclasses.
  • next() -- Abstract method that returns a tuple of (allowed_token_ids, end_token_ids) for the current state. Must be overridden by subclasses.
  • use_background_worker() -- Returns False by default. Subclasses return True to indicate that next() should be scheduled asynchronously to overlap with CUDA forward passes. Recommended for CPU-intensive filters like grammar constraints.
  • can_mask_logits() -- Returns False by default. Subclasses return True if they can directly manipulate the logit tensor rather than returning token ID sets.
  • prepare_logit_mask() -- Called in place of next() to precompute a logit mask when can_mask_logits() is True.
  • mask_logits(logits) -- Directly sets excluded logits to negative infinity in the provided tensor.
  • background_next(pool) -- Submits next() to a ThreadPoolExecutor for asynchronous execution.
  • background_drop() -- Clears any pending asynchronous result, used when a filter forces EOS selection.
  • get_next(mask=False) -- Returns the result of next() (or prepare_logit_mask()) either directly or from a pending background computation.

Usage

Use ExLlamaV2Filter as the base class when implementing custom token filters. All built-in filters (ExLlamaV2PrefixFilter, ExLlamaV2SelectFilter, ExLlamaV2GrammarFilter, etc.) extend this class. Instances are passed to ExLlamaV2DynamicJob or sampler settings via the filters parameter.

Code Reference

Source Location

Signature

class ExLlamaV2Filter:

    model: ExLlamaV2
    tokenizer: ExLlamaV2Tokenizer
    sequence_str: str
    background_result: Future | None
    allow_return_type_list: bool = True

    def __init__(self,
                 model: ExLlamaV2,
                 tokenizer: ExLlamaV2Tokenizer):
        ...

    def clone(self, c=None) -> ExLlamaV2Filter:
        ...

    def begin(self, prefix_str: str) -> None:
        ...

    def feed(self, token) -> None:
        ...

    def next(self) -> tuple[set | list, set | list]:
        ...

    def use_background_worker(self) -> bool:
        ...

    def can_mask_logits(self) -> bool:
        ...

    def prepare_logit_mask(self) -> bool:
        ...

    def mask_logits(self, logits: torch.Tensor) -> torch.Tensor:
        ...

    def background_next(self, pool: ThreadPoolExecutor) -> None:
        ...

    def get_next(self, mask: bool = False) -> tuple:
        ...

Import

from exllamav2.generator.filters import ExLlamaV2Filter

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes The loaded ExLlamaV2 model instance
tokenizer ExLlamaV2Tokenizer Yes The tokenizer associated with the model
prefix_str str Yes (begin) String prefix already generated before the filter begins constraining
token int or tensor Yes (feed) The token ID selected by the sampler at each step
logits torch.Tensor Yes (mask_logits) The raw logit tensor to mask in-place
pool ThreadPoolExecutor Yes (background_next) Thread pool for scheduling asynchronous filter computation
mask bool No (get_next, default False) If True, calls prepare_logit_mask() instead of next()

Outputs

Name Type Description
allowed_tokens, end_tokens tuple[set or list, set or list] From next(): the set of allowed token IDs and the set of tokens that would complete the constraint
use_background bool From use_background_worker(): whether to run next() asynchronously
can_mask bool From can_mask_logits(): whether the filter supports direct logit masking
masked_logits torch.Tensor From mask_logits(): logit tensor with disallowed positions set to -inf

Usage Examples

Implementing a Custom Filter

from exllamav2.generator.filters import ExLlamaV2Filter

class MyCustomFilter(ExLlamaV2Filter):

    def __init__(self, model, tokenizer, allowed_words):
        super().__init__(model, tokenizer)
        # Pre-encode allowed words to token IDs
        self.allowed_ids = set()
        for word in allowed_words:
            ids = tokenizer.encode(word, add_bos=False)
            for id in ids[0].tolist():
                self.allowed_ids.add(id)

    def begin(self, prefix_str):
        pass  # No state to reset

    def feed(self, token):
        pass  # No state to update

    def next(self):
        return self.allowed_ids, set()

Using a Filter with a DynamicJob

from exllamav2.generator import ExLlamaV2DynamicJob

my_filter = MyCustomFilter(model, tokenizer, ["yes", "no", "maybe"])

job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    gen_settings=gen_settings,
    max_new_tokens=10,
    filters=[my_filter],
)
generator.enqueue(job)

Related Pages

Implements Principle

Extended By

Used By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment