Principle:Turboderp org Exllamav2 Constrained Generation

Knowledge Sources	ExLlamaV2 Efficient Guided Generation for LLMs
Domains	Text_Generation, Constrained_Decoding, NLP
Last Updated	2026-02-15 00:00 GMT

Overview

Constrained generation restricts the set of tokens a language model can produce at each decoding step, ensuring output conforms to a desired format, grammar, or set of allowed completions.

Description

During unconstrained generation, a language model can produce any token from its vocabulary at each step. Constrained generation modifies the logits (pre-softmax scores) before sampling to mask out invalid tokens, forcing the output to match specific patterns:

Token-level filtering: At each decoding step, a filter computes which tokens are valid continuations. Invalid tokens receive negative-infinity logits, making them impossible to sample. The filter maintains internal state that tracks the generation progress.

Prefix-constrained generation: The model is forced to generate text that matches one of several predefined prefix strings. At each step, only tokens that are consistent with at least one allowed prefix are permitted. This is useful for classification tasks or structured extraction.

Selection filtering: The model must choose exactly one option from a predefined set of strings. The filter constrains generation to tokens consistent with the remaining candidates, eliminating options character-by-character until a single choice is determined.

Grammar-based filtering: External tools like LMFE (Language Model Format Enforcer) provide token masks based on formal grammars (JSON schema, regex, etc.). These integrate with ExLlamaV2's filter interface to enforce structured output like valid JSON.

Usage

Constrained generation is used when:

JSON/structured output: Ensuring model output is valid JSON conforming to a schema
Multiple-choice questions: Forcing the model to select from predefined options
Classification: Constraining output to a set of class labels
Template filling: Generating text that matches a specific template or format
API responses: Ensuring generated content conforms to an API contract

Theoretical Basis

Logit Masking

# At each decoding step t:
# 1. Model produces logits z_i for all tokens i in vocabulary V
# 2. Filter computes allowed set A_t ⊆ V based on generation history
# 3. For each token i:
#    if i ∈ A_t: z'_i = z_i (unchanged)
#    if i ∉ A_t: z'_i = -inf (masked out)
# 4. Sample from masked distribution: softmax(z')

Prefix Tree Matching

# Given allowed strings S = {s_1, s_2, ..., s_n}
# Build a trie (prefix tree) from all s_i
# At position t in generated output:
# A_t = {tokens whose encoding is consistent with at least one
#         path in the trie from the current node}
# As tokens are generated, prune branches that no longer match

Related Pages

Implemented By

Related Principles

Principle:Turboderp_org_Exllamav2_Sampling_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment