Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2TokenEnforcerFilter

From Leeroopedia
Knowledge Sources
Domains Constrained_Generation, Filtering
Last Updated 2026-02-15 00:00 GMT

Overview

Compatibility wrapper that integrates the lm-format-enforcer (LMFE) library's TokenEnforcer with ExLlamaV2's filter system, enabling structured-output constraints such as JSON schema enforcement during generation.

Description

ExLlamaV2TokenEnforcerFilter is a subclass of ExLlamaV2Filter that bridges between ExLlamaV2's token filtering interface and LMFE's TokenEnforcer. It maintains an internal token_sequence list that tracks all generated token IDs, and delegates allowed-token computation to the LMFE engine.

Key components:

  • __init__(model, tokenizer, character_level_parser) -- Accepts an ExLlamaV2 model, ExLlamaV2Tokenizer, and an LMFE CharacterLevelParser (e.g., a JSON schema parser). It builds LMFE tokenizer data via _get_lmfe_tokenizer_data() and creates a TokenEnforcer instance.
  • _get_lmfe_tokenizer_data(tokenizer) -- Module-level function decorated with @lru_cache(10) that calls build_token_enforcer_tokenizer_data() from LMFE. The LRU cache ensures tokenizer data is computed only once per tokenizer instance.
  • begin(prefix_str) -- Resets the internal token_sequence to an empty list at the start of each generation.
  • feed(token) -- Appends the integer value of the generated token to token_sequence.
  • next() -- Queries self.token_enforcer.get_allowed_tokens(self.token_sequence) and returns a tuple of (sorted allowed token IDs, empty list).
  • use_background_worker() -- Returns True, indicating that the LMFE constraint computation should run asynchronously to overlap with CUDA forward passes.

Usage

Use ExLlamaV2TokenEnforcerFilter when you need to constrain ExLlamaV2 generation output to conform to a grammar, JSON schema, or other structured format defined via an LMFE CharacterLevelParser. Pass an instance as a filter to ExLlamaV2DynamicJob or ExLlamaV2Sampler.

Code Reference

Source Location

Signature

@lru_cache(10)
def _get_lmfe_tokenizer_data(tokenizer: ExLlamaV2Tokenizer):
    ...

class ExLlamaV2TokenEnforcerFilter(ExLlamaV2Filter):

    token_sequence: List[int]

    def __init__(
        self,
        model: ExLlamaV2,
        tokenizer: ExLlamaV2Tokenizer,
        character_level_parser: CharacterLevelParser,
    ):
        ...

    def begin(self, prefix_str: str) -> None:
        ...

    def feed(self, token) -> None:
        ...

    def next(self):
        ...

    def use_background_worker(self):
        ...

Import

from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes The loaded ExLlamaV2 model instance
tokenizer ExLlamaV2Tokenizer Yes The tokenizer associated with the model
character_level_parser CharacterLevelParser Yes LMFE parser defining the output constraint (e.g., JsonSchemaParser)
prefix_str str Yes (begin) Prefix string passed at the start of generation (used to reset state)
token tensor Yes (feed) Single generated token tensor; integer value extracted via token[0][0]

Outputs

Name Type Description
allowed_tokens tuple[list[int], list] From next(): a tuple of (sorted list of allowed token IDs, empty list of end tokens)
use_background bool From use_background_worker(): always True

Usage Examples

JSON Schema Constrained Generation

from examples.inference_lmfe_wrapper import ExLlamaV2TokenEnforcerFilter
from lmformatenforcer import JsonSchemaParser

# Define a JSON schema constraint
schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"}
    },
    "required": ["name", "age"]
}

# Create the filter
parser = JsonSchemaParser(schema)
lmfe_filter = ExLlamaV2TokenEnforcerFilter(model, tokenizer, parser)

# Use with ExLlamaV2DynamicJob
job = ExLlamaV2DynamicJob(
    input_ids=input_ids,
    gen_settings=gen_settings,
    max_new_tokens=256,
    filters=[lmfe_filter],
)

Related Pages

Implements Principle

Extends

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment