Implementation:Turboderp org Exllamav2 ExLlamaV2TokenEnforcerFilter

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Constrained_Generation, Filtering
Last Updated	2026-02-15 00:00 GMT

Overview

Compatibility wrapper that integrates the lm-format-enforcer (LMFE) library's TokenEnforcer with ExLlamaV2's filter system, enabling structured-output constraints such as JSON schema enforcement during generation.

Description

ExLlamaV2TokenEnforcerFilter is a subclass of ExLlamaV2Filter that bridges between ExLlamaV2's token filtering interface and LMFE's TokenEnforcer. It maintains an internal token_sequence list that tracks all generated token IDs, and delegates allowed-token computation to the LMFE engine.

Key components:

__init__(model, tokenizer, character_level_parser) -- Accepts an ExLlamaV2 model, ExLlamaV2Tokenizer, and an LMFE CharacterLevelParser (e.g., a JSON schema parser). It builds LMFE tokenizer data via _get_lmfe_tokenizer_data() and creates a TokenEnforcer instance.
_get_lmfe_tokenizer_data(tokenizer) -- Module-level function decorated with @lru_cache(10) that calls build_token_enforcer_tokenizer_data() from LMFE. The LRU cache ensures tokenizer data is computed only once per tokenizer instance.
begin(prefix_str) -- Resets the internal token_sequence to an empty list at the start of each generation.
feed(token) -- Appends the integer value of the generated token to token_sequence.
next() -- Queries self.token_enforcer.get_allowed_tokens(self.token_sequence) and returns a tuple of (sorted allowed token IDs, empty list).
use_background_worker() -- Returns True, indicating that the LMFE constraint computation should run asynchronously to overlap with CUDA forward passes.

Usage

Use ExLlamaV2TokenEnforcerFilter when you need to constrain ExLlamaV2 generation output to conform to a grammar, JSON schema, or other structured format defined via an LMFE CharacterLevelParser. Pass an instance as a filter to ExLlamaV2DynamicJob or ExLlamaV2Sampler.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: examples/inference_lmfe_wrapper.py
Lines: L1-47

Signature

@lru_cache(10)
def _get_lmfe_tokenizer_data(tokenizer: ExLlamaV2Tokenizer):
    ...

class ExLlamaV2TokenEnforcerFilter(ExLlamaV2Filter):

    token_sequence: List[int]

    def __init__(
        self,
        model: ExLlamaV2,
        tokenizer: ExLlamaV2Tokenizer,
        character_level_parser: CharacterLevelParser,
    ):
        ...

    def begin(self, prefix_str: str) -> None:
        ...

    def feed(self, token) -> None:
        ...

    def next(self):
        ...

    def use_background_worker(self):
        ...

Import

from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	The loaded ExLlamaV2 model instance
tokenizer	ExLlamaV2Tokenizer	Yes	The tokenizer associated with the model
character_level_parser	CharacterLevelParser	Yes	LMFE parser defining the output constraint (e.g., JsonSchemaParser)
prefix_str	str	Yes (begin)	Prefix string passed at the start of generation (used to reset state)
token	tensor	Yes (feed)	Single generated token tensor; integer value extracted via token[0][0]

Outputs

Name	Type	Description
allowed_tokens	tuple[list[int], list]	From next(): a tuple of (sorted list of allowed token IDs, empty list of end tokens)
use_background	bool	From use_background_worker(): always True

Usage Examples

JSON Schema Constrained Generation

from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter
from lmformatenforcer import JsonSchemaParser

# Define a JSON schema constraint
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"}
    },
    "required": ["name", "age"]
}

# Create the filter
parser = JsonSchemaParser(schema)
lmfe_filter = ExLlamaV2TokenEnforcerFilter(model, tokenizer, parser)

# Use with ExLlamaV2DynamicJob
job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    gen_settings=gen_settings,
    max_new_tokens=256,
    filters=[lmfe_filter],
)

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Constrained_Generation

Extends

Implementation:Turboderp_org_Exllamav2_ExLlamaV2Filter

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment