Principle:Turboderp org Exllamav2 Constrained Generation
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Constrained_Decoding, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Constrained generation restricts the set of tokens a language model can produce at each decoding step, ensuring output conforms to a desired format, grammar, or set of allowed completions.
Description
During unconstrained generation, a language model can produce any token from its vocabulary at each step. Constrained generation modifies the logits (pre-softmax scores) before sampling to mask out invalid tokens, forcing the output to match specific patterns:
- Token-level filtering: At each decoding step, a filter computes which tokens are valid continuations. Invalid tokens receive negative-infinity logits, making them impossible to sample. The filter maintains internal state that tracks the generation progress.
- Prefix-constrained generation: The model is forced to generate text that matches one of several predefined prefix strings. At each step, only tokens that are consistent with at least one allowed prefix are permitted. This is useful for classification tasks or structured extraction.
- Selection filtering: The model must choose exactly one option from a predefined set of strings. The filter constrains generation to tokens consistent with the remaining candidates, eliminating options character-by-character until a single choice is determined.
- Grammar-based filtering: External tools like LMFE (Language Model Format Enforcer) provide token masks based on formal grammars (JSON schema, regex, etc.). These integrate with ExLlamaV2's filter interface to enforce structured output like valid JSON.
Usage
Constrained generation is used when:
- JSON/structured output: Ensuring model output is valid JSON conforming to a schema
- Multiple-choice questions: Forcing the model to select from predefined options
- Classification: Constraining output to a set of class labels
- Template filling: Generating text that matches a specific template or format
- API responses: Ensuring generated content conforms to an API contract
Theoretical Basis
Logit Masking
# At each decoding step t:
# 1. Model produces logits z_i for all tokens i in vocabulary V
# 2. Filter computes allowed set A_t ⊆ V based on generation history
# 3. For each token i:
# if i ∈ A_t: z'_i = z_i (unchanged)
# if i ∉ A_t: z'_i = -inf (masked out)
# 4. Sample from masked distribution: softmax(z')
Prefix Tree Matching
# Given allowed strings S = {s_1, s_2, ..., s_n}
# Build a trie (prefix tree) from all s_i
# At position t in generated output:
# A_t = {tokens whose encoding is consistent with at least one
# path in the trie from the current node}
# As tokens are generated, prune branches that no longer match
Related Pages
Implemented By
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2Filter
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2PrefixFilter
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2SelectFilter
- Implementation:Turboderp_org_Exllamav2_ExLlamaV2TokenEnforcerFilter
Related Principles
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment