Implementation:Turboderp org Exllamav2 ExLlamaV2TokenEnforcerFilter
| Knowledge Sources | |
|---|---|
| Domains | Constrained_Generation, Filtering |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Compatibility wrapper that integrates the lm-format-enforcer (LMFE) library's TokenEnforcer with ExLlamaV2's filter system, enabling structured-output constraints such as JSON schema enforcement during generation.
Description
ExLlamaV2TokenEnforcerFilter is a subclass of ExLlamaV2Filter that bridges between ExLlamaV2's token filtering interface and LMFE's TokenEnforcer. It maintains an internal token_sequence list that tracks all generated token IDs, and delegates allowed-token computation to the LMFE engine.
Key components:
- __init__(model, tokenizer, character_level_parser) -- Accepts an ExLlamaV2 model, ExLlamaV2Tokenizer, and an LMFE CharacterLevelParser (e.g., a JSON schema parser). It builds LMFE tokenizer data via _get_lmfe_tokenizer_data() and creates a TokenEnforcer instance.
- _get_lmfe_tokenizer_data(tokenizer) -- Module-level function decorated with @lru_cache(10) that calls build_token_enforcer_tokenizer_data() from LMFE. The LRU cache ensures tokenizer data is computed only once per tokenizer instance.
- begin(prefix_str) -- Resets the internal token_sequence to an empty list at the start of each generation.
- feed(token) -- Appends the integer value of the generated token to token_sequence.
- next() -- Queries self.token_enforcer.get_allowed_tokens(self.token_sequence) and returns a tuple of (sorted allowed token IDs, empty list).
- use_background_worker() -- Returns True, indicating that the LMFE constraint computation should run asynchronously to overlap with CUDA forward passes.
Usage
Use ExLlamaV2TokenEnforcerFilter when you need to constrain ExLlamaV2 generation output to conform to a grammar, JSON schema, or other structured format defined via an LMFE CharacterLevelParser. Pass an instance as a filter to ExLlamaV2DynamicJob or ExLlamaV2Sampler.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: examples/inference_lmfe_wrapper.py
- Lines: L1-47
Signature
@lru_cache(10)
def _get_lmfe_tokenizer_data(tokenizer: ExLlamaV2Tokenizer):
...
class ExLlamaV2TokenEnforcerFilter(ExLlamaV2Filter):
token_sequence: List[int]
def __init__(
self,
model: ExLlamaV2,
tokenizer: ExLlamaV2Tokenizer,
character_level_parser: CharacterLevelParser,
):
...
def begin(self, prefix_str: str) -> None:
...
def feed(self, token) -> None:
...
def next(self):
...
def use_background_worker(self):
...
Import
from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | The loaded ExLlamaV2 model instance |
| tokenizer | ExLlamaV2Tokenizer | Yes | The tokenizer associated with the model |
| character_level_parser | CharacterLevelParser | Yes | LMFE parser defining the output constraint (e.g., JsonSchemaParser) |
| prefix_str | str | Yes (begin) | Prefix string passed at the start of generation (used to reset state) |
| token | tensor | Yes (feed) | Single generated token tensor; integer value extracted via token[0][0] |
Outputs
| Name | Type | Description |
|---|---|---|
| allowed_tokens | tuple[list[int], list] | From next(): a tuple of (sorted list of allowed token IDs, empty list of end tokens) |
| use_background | bool | From use_background_worker(): always True |
Usage Examples
JSON Schema Constrained Generation
from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter
from lmformatenforcer import JsonSchemaParser
# Define a JSON schema constraint
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
# Create the filter
parser = JsonSchemaParser(schema)
lmfe_filter = ExLlamaV2TokenEnforcerFilter(model, tokenizer, parser)
# Use with ExLlamaV2DynamicJob
job = ExLlamaV2DynamicJob(
input_ids=input_ids,
gen_settings=gen_settings,
max_new_tokens=256,
filters=[lmfe_filter],
)