Implementation:Mlc ai Web llm Grammar Matcher Decoding
Overview
Grammar Matcher Decoding is the concrete implementation of grammar-constrained token masking during LLM decoding in @mlc-ai/web-llm. It is implemented within the LLMChatPipeline class in src/llm_chat.ts, integrating the @mlc-ai/web-xgrammar library's GrammarMatcher, GrammarCompiler, and TokenizerInfo into the autoregressive decode loop. The implementation handles grammar compilation during prefill, bitmask computation and GPU application during each decode step, and parse state updates after each sampled token.
Description
The implementation spans three key locations in src/llm_chat.ts:
1. Grammar Initialization in prefillStep() (Lines 616-681)
During the prefill phase, the engine initializes the grammar matcher concurrently with prompt processing:
- If
response_format.typeis"json_object","grammar", or"structural_tag", a grammar initialization promise is created. - Cache check: If the response format key matches the cached key and a
GrammarMatcheralready exists, the matcher is simply reset viagrammarMatcher.reset()-- no recompilation needed. - Fresh initialization: Otherwise, the existing matcher is disposed and a new one is created:
TokenizerInfois created from the tokenizer's raw token table (only once, then cached).GrammarCompileris created from theTokenizerInfo(only once, then cached).- The schema/grammar is compiled into a
CompiledGrammarusing the appropriate compiler method. - A
GrammarMatcheris created from theCompiledGrammar. - The
CompiledGrammaris disposed after the matcher is created.
- Compilation time is recorded in
curRoundGrammarInitTotalTime.
2. Bitmask Application in sampleTokenFromLogits() (Lines 1207-1254)
During each decode step, the grammar bitmask is applied to logits on GPU:
- A boolean
grammarConstrainedis computed fromresponse_format.type. grammarMatcher.getNextTokenBitmask()returns anInt32Arrayrepresenting the valid token bitmask.- The bitmask is copied to GPU as an
int32tensor. fapplyBitmask(logitsOnGPU, seqIdsArray, bitMaskOnGPU)applies the mask in place, setting invalid token logits to-infinity.- Per-token grammar time is accumulated in
curRoundGrammarPerTokenTotalTime.
3. Token Acceptance (Lines 1472-1484)
After sampling, the grammar matcher state is updated:
grammarMatcher.acceptToken(sampledToken)advances the parse state.- If the token is rejected (returns
false), an error is thrown -- this should never happen because the bitmask already excluded invalid tokens. - Accept time is accumulated in
curRoundGrammarPerTokenTotalTime.
Code Reference
Grammar Initialization (prefillStep)
Source: src/llm_chat.ts, lines 616-681
// -1. Instantiate grammar matcher according to generation config.
// This step is overlapped with prefilling the prompt to hide overhead.
let grammarMatcherInitPromise: Promise<void> | undefined = undefined;
const responseFormat = genConfig?.response_format;
if (
responseFormat?.type === "json_object" ||
responseFormat?.type === "grammar" ||
responseFormat?.type === "structural_tag"
) {
const curResponseFormatKey = this.getResponseFormatKey(responseFormat);
if (
curResponseFormatKey === this.responseFormatCacheKey &&
this.grammarMatcher
) {
// Reuse cached grammar matcher -- just reset its state
const tGrammarInitStart = performance.now();
log.info("Reuse grammar matcher.");
this.grammarMatcher.reset();
this.curRoundGrammarInitTotalTime =
(performance.now() - tGrammarInitStart) / 1e3;
} else {
// Dispose current, reinitialize, and update cache key
grammarMatcherInitPromise = new Promise(async (resolve) => {
const tGrammarInitStart = performance.now();
log.info("Initialize new grammar matcher.");
if (this.grammarMatcher) {
this.grammarMatcher.dispose();
}
if (this.xgTokenizerInfo === undefined) {
log.info("Initialize token table.");
const rawTokenTable = getTokenTableFromTokenizer(this.tokenizer);
this.xgTokenizerInfo = await xgr.TokenizerInfo.createTokenizerInfo(
rawTokenTable,
this.token_postproc_method,
this.prepend_space_in_encode,
this.fullVocabSize,
this.stopTokens,
);
this.grammarCompiler =
await xgr.GrammarCompiler.createGrammarCompiler(
this.xgTokenizerInfo,
);
}
const grammar: xgr.CompiledGrammar =
responseFormat.type === undefined
? await this.grammarCompiler!.compileBuiltinJSONGrammar()
: responseFormat.type === "json_object"
? await this.grammarCompiler!.compileJSONSchema(
responseFormat.schema!,
)
: responseFormat.type === "grammar"
? await this.grammarCompiler!.compileGrammar(
responseFormat.grammar!,
)
: await this.grammarCompiler!.compileStructuralTag(
responseFormat.structural_tag!,
);
this.grammarMatcher =
await xgr.GrammarMatcher.createGrammarMatcher(grammar);
grammar.dispose();
this.responseFormatCacheKey = curResponseFormatKey;
this.curRoundGrammarInitTotalTime =
(performance.now() - tGrammarInitStart) / 1e3;
resolve();
});
}
}
Bitmask Application (sampleTokenFromLogits)
Source: src/llm_chat.ts, lines 1207-1254
const grammarConstrained =
response_format?.type === "json_object" ||
response_format?.type === "grammar" ||
response_format?.type === "structural_tag";
// 0. Update logitsOnGPU with on-GPU grammar bitmasking
if (grammarConstrained) {
const grammarBitmaskBegin = performance.now();
this.tvm.beginScope();
if (this.grammarMatcher === undefined) {
throw Error("Expect grammar matcher to be initialized.");
}
const tBitmaskStart = performance.now();
const bitMaskOnCPU: Int32Array =
await this.grammarMatcher.getNextTokenBitmask();
this.curRoundGrammarPerTokenTotalTime +=
(performance.now() - tBitmaskStart) / 1e3;
if (bitMaskOnCPU.length !== this.bitmaskSize) {
throw new Error(
`InternalError: Expect grammar bitmask to be ` +
`size ${this.bitmaskSize}, but got ${bitMaskOnCPU.length}.`,
);
}
const bitMaskOnGPU = this.tvm
.empty([1, this.bitmaskSize], "int32", this.device)
.copyFrom(bitMaskOnCPU);
const seqIdsArray = this.tvm
.empty([1], "int32", this.device)
.copyFrom([0]);
this.fapplyBitmask(
logitsOnGPU.view([1, this.fullVocabSize]),
seqIdsArray,
bitMaskOnGPU,
);
this.tvm.endScope();
}
Token Acceptance
Source: src/llm_chat.ts, lines 1472-1484
// 6. Update grammar matcher with new token
if (grammarConstrained) {
if (this.grammarMatcher === undefined) {
throw Error("Expect grammar matcher to be initialized.");
}
const tAcceptStart = performance.now();
const accepted = this.grammarMatcher.acceptToken(sampledToken);
this.curRoundGrammarPerTokenTotalTime +=
(performance.now() - tAcceptStart) / 1e3;
if (!accepted) {
throw Error("Grammar matcher rejected the newly sampled token.");
}
}
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | ChatCompletionRequest with response_format |
A request containing a schema, grammar, or structural tag definition |
| Output | ChatCompletion |
A completion response whose choices[].message.content is guaranteed to conform to the specified grammar
|
Internal Data Flow
| Step | Operation | Data Type | Location |
|---|---|---|---|
| 1 | Schema compilation | ResponseFormat -> CompiledGrammar -> GrammarMatcher |
prefillStep()
|
| 2 | Bitmask computation | GrammarMatcher -> Int32Array |
sampleTokenFromLogits()
|
| 3 | GPU bitmask application | Int32Array -> tvmjs.Tensor -> applied to logits |
sampleTokenFromLogits()
|
| 4 | Token sampling | Masked logits -> number (token ID) |
sampleTokenFromLogits()
|
| 5 | State update | number -> grammarMatcher.acceptToken() |
sampleTokenFromLogits()
|
External Dependencies
| Package | Version | Components Used |
|---|---|---|
@mlc-ai/web-xgrammar |
0.1.27 | GrammarMatcher, GrammarCompiler, TokenizerInfo, CompiledGrammar
|
@mlc-ai/web-runtime |
(bundled) | TVM runtime for GPU tensor operations (tvm.empty(), fapplyBitmask())
|
Performance Metrics
The implementation records two performance metrics available in usage.extra of the response:
grammar_init_s-- Total seconds spent initializing the grammar matcher (compilation + matcher creation). Ifn > 1(multiple choices), this is the sum across all choices.grammar_per_token_s-- Average seconds per token spent on bitmask generation and token acceptance. Ifn > 1, this is the average across all choices.
Usage Examples
Observing Grammar Metrics in a Non-Streaming Request
import * as webllm from "@mlc-ai/web-llm";
const engine = await webllm.CreateMLCEngine(
"Llama-3.2-3B-Instruct-q4f16_1-MLC",
{ logLevel: "INFO" },
);
const schema = JSON.stringify({
type: "object",
properties: {
name: { type: "string" },
house: {
type: "string",
enum: ["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"],
},
alive: { type: "boolean" },
},
required: ["name", "house", "alive"],
});
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{
role: "user",
content: "Fill in info about Harry Potter in JSON format.",
},
],
max_tokens: 128,
response_format: {
type: "json_object",
schema: schema,
} as webllm.ResponseFormat,
});
// The output is guaranteed to be valid JSON conforming to the schema
console.log("Content:", reply.choices[0].message.content);
// Grammar-specific performance metrics
const extra = reply.usage?.extra;
if (extra) {
console.log("Grammar init time:", extra.grammar_init_s, "seconds");
console.log("Grammar per-token time:", extra.grammar_per_token_s, "seconds");
}
Streaming with Grammar Constraints
Grammar-constrained decoding also works with streaming responses. The grammar bitmask is applied at each decode step regardless of whether output is streamed:
import * as webllm from "@mlc-ai/web-llm";
const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");
const schema = JSON.stringify({
type: "object",
properties: {
city: { type: "string" },
temperature: { type: "number" },
conditions: { type: "string" },
},
required: ["city", "temperature", "conditions"],
});
const chunks = await engine.chat.completions.create({
stream: true,
messages: [
{ role: "user", content: "Give me weather info for Tokyo in JSON." },
],
max_tokens: 128,
response_format: {
type: "json_object",
schema: schema,
} as webllm.ResponseFormat,
});
let fullContent = "";
for await (const chunk of chunks) {
const delta = chunk.choices[0]?.delta?.content ?? "";
fullContent += delta;
process.stdout.write(delta);
}
// fullContent is guaranteed to be valid JSON matching the schema
const parsed = JSON.parse(fullContent);
console.log("\nParsed:", parsed);
Related Pages
- Principle: Grammar-Constrained Decoding -- Principle:Mlc_ai_Web_llm_Grammar_Constrained_Decoding
- Implementation: Response Format -- The interface that specifies which grammar to compile
- Implementation: JSON Parse Output -- Pattern for parsing the output produced by this decoding process
- Environment:Mlc_ai_Web_llm_WebGPU_Browser_Runtime
- Heuristic:Mlc_ai_Web_llm_Grammar_Matcher_Reuse
- Heuristic:Mlc_ai_Web_llm_Tokenizer_JSON_Preference