Implementation:Turboderp org Exllamav2 ExLlamaV2Filter
| Knowledge Sources | |
|---|---|
| Domains | Filtering, Constrained_Generation |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Abstract base class for all token filters in ExLlamaV2's generation pipeline, defining the interface for constraining token selection during autoregressive decoding.
Description
ExLlamaV2Filter establishes the contract that every token filter must implement. Filters restrict which tokens the sampler may select at each generation step, enabling constrained generation patterns such as grammar enforcement, prefix matching, and multiple-choice selection.
Key methods:
- __init__(model, tokenizer) -- Stores references to the ExLlamaV2 model and ExLlamaV2Tokenizer, initialises sequence_str to an empty string and background_result to None.
- clone(c=None) -- Creates a shallow copy of the filter, preserving model, tokenizer, and sequence state. Subclasses override this to copy their own state.
- begin(prefix_str) -- Abstract method called at the start of each generation to reset filter state. Must be overridden by subclasses.
- feed(token) -- Abstract method called after each token is selected, allowing the filter to update its internal state. Must be overridden by subclasses.
- next() -- Abstract method that returns a tuple of (allowed_token_ids, end_token_ids) for the current state. Must be overridden by subclasses.
- use_background_worker() -- Returns False by default. Subclasses return True to indicate that next() should be scheduled asynchronously to overlap with CUDA forward passes. Recommended for CPU-intensive filters like grammar constraints.
- can_mask_logits() -- Returns False by default. Subclasses return True if they can directly manipulate the logit tensor rather than returning token ID sets.
- prepare_logit_mask() -- Called in place of next() to precompute a logit mask when can_mask_logits() is True.
- mask_logits(logits) -- Directly sets excluded logits to negative infinity in the provided tensor.
- background_next(pool) -- Submits next() to a ThreadPoolExecutor for asynchronous execution.
- background_drop() -- Clears any pending asynchronous result, used when a filter forces EOS selection.
- get_next(mask=False) -- Returns the result of next() (or prepare_logit_mask()) either directly or from a pending background computation.
Usage
Use ExLlamaV2Filter as the base class when implementing custom token filters. All built-in filters (ExLlamaV2PrefixFilter, ExLlamaV2SelectFilter, ExLlamaV2GrammarFilter, etc.) extend this class. Instances are passed to ExLlamaV2DynamicJob or sampler settings via the filters parameter.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/generator/filters/base.py
- Lines: L1-119
Signature
class ExLlamaV2Filter:
model: ExLlamaV2
tokenizer: ExLlamaV2Tokenizer
sequence_str: str
background_result: Future | None
allow_return_type_list: bool = True
def __init__(self,
model: ExLlamaV2,
tokenizer: ExLlamaV2Tokenizer):
...
def clone(self, c=None) -> ExLlamaV2Filter:
...
def begin(self, prefix_str: str) -> None:
...
def feed(self, token) -> None:
...
def next(self) -> tuple[set | list, set | list]:
...
def use_background_worker(self) -> bool:
...
def can_mask_logits(self) -> bool:
...
def prepare_logit_mask(self) -> bool:
...
def mask_logits(self, logits: torch.Tensor) -> torch.Tensor:
...
def background_next(self, pool: ThreadPoolExecutor) -> None:
...
def get_next(self, mask: bool = False) -> tuple:
...
Import
from exllamav2.generator.filters import ExLlamaV2Filter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | The loaded ExLlamaV2 model instance |
| tokenizer | ExLlamaV2Tokenizer | Yes | The tokenizer associated with the model |
| prefix_str | str | Yes (begin) | String prefix already generated before the filter begins constraining |
| token | int or tensor | Yes (feed) | The token ID selected by the sampler at each step |
| logits | torch.Tensor | Yes (mask_logits) | The raw logit tensor to mask in-place |
| pool | ThreadPoolExecutor | Yes (background_next) | Thread pool for scheduling asynchronous filter computation |
| mask | bool | No (get_next, default False) | If True, calls prepare_logit_mask() instead of next() |
Outputs
| Name | Type | Description |
|---|---|---|
| allowed_tokens, end_tokens | tuple[set or list, set or list] | From next(): the set of allowed token IDs and the set of tokens that would complete the constraint |
| use_background | bool | From use_background_worker(): whether to run next() asynchronously |
| can_mask | bool | From can_mask_logits(): whether the filter supports direct logit masking |
| masked_logits | torch.Tensor | From mask_logits(): logit tensor with disallowed positions set to -inf |
Usage Examples
Implementing a Custom Filter
from exllamav2.generator.filters import ExLlamaV2Filter
class MyCustomFilter(ExLlamaV2Filter):
def __init__(self, model, tokenizer, allowed_words):
super().__init__(model, tokenizer)
# Pre-encode allowed words to token IDs
self.allowed_ids = set()
for word in allowed_words:
ids = tokenizer.encode(word, add_bos=False)
for id in ids[0].tolist():
self.allowed_ids.add(id)
def begin(self, prefix_str):
pass # No state to reset
def feed(self, token):
pass # No state to update
def next(self):
return self.allowed_ids, set()
Using a Filter with a DynamicJob
from exllamav2.generator import ExLlamaV2DynamicJob
my_filter = MyCustomFilter(model, tokenizer, ["yes", "no", "maybe"])
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
gen_settings=gen_settings,
max_new_tokens=10,
filters=[my_filter],
)
generator.enqueue(job)
Related Pages
Implements Principle
Extended By
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2PrefixFilter
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2SelectFilter
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2TokenEnforcerFilter