Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlc ai Web llm Grammar Matcher Decoding

From Leeroopedia

Template:Metadata

Overview

Grammar Matcher Decoding is the concrete implementation of grammar-constrained token masking during LLM decoding in @mlc-ai/web-llm. It is implemented within the LLMChatPipeline class in src/llm_chat.ts, integrating the @mlc-ai/web-xgrammar library's GrammarMatcher, GrammarCompiler, and TokenizerInfo into the autoregressive decode loop. The implementation handles grammar compilation during prefill, bitmask computation and GPU application during each decode step, and parse state updates after each sampled token.

Description

The implementation spans three key locations in src/llm_chat.ts:

1. Grammar Initialization in prefillStep() (Lines 616-681)

During the prefill phase, the engine initializes the grammar matcher concurrently with prompt processing:

  • If response_format.type is "json_object", "grammar", or "structural_tag", a grammar initialization promise is created.
  • Cache check: If the response format key matches the cached key and a GrammarMatcher already exists, the matcher is simply reset via grammarMatcher.reset() -- no recompilation needed.
  • Fresh initialization: Otherwise, the existing matcher is disposed and a new one is created:
    • TokenizerInfo is created from the tokenizer's raw token table (only once, then cached).
    • GrammarCompiler is created from the TokenizerInfo (only once, then cached).
    • The schema/grammar is compiled into a CompiledGrammar using the appropriate compiler method.
    • A GrammarMatcher is created from the CompiledGrammar.
    • The CompiledGrammar is disposed after the matcher is created.
  • Compilation time is recorded in curRoundGrammarInitTotalTime.

2. Bitmask Application in sampleTokenFromLogits() (Lines 1207-1254)

During each decode step, the grammar bitmask is applied to logits on GPU:

  • A boolean grammarConstrained is computed from response_format.type.
  • grammarMatcher.getNextTokenBitmask() returns an Int32Array representing the valid token bitmask.
  • The bitmask is copied to GPU as an int32 tensor.
  • fapplyBitmask(logitsOnGPU, seqIdsArray, bitMaskOnGPU) applies the mask in place, setting invalid token logits to -infinity.
  • Per-token grammar time is accumulated in curRoundGrammarPerTokenTotalTime.

3. Token Acceptance (Lines 1472-1484)

After sampling, the grammar matcher state is updated:

  • grammarMatcher.acceptToken(sampledToken) advances the parse state.
  • If the token is rejected (returns false), an error is thrown -- this should never happen because the bitmask already excluded invalid tokens.
  • Accept time is accumulated in curRoundGrammarPerTokenTotalTime.

Code Reference

Grammar Initialization (prefillStep)

Source: src/llm_chat.ts, lines 616-681

// -1. Instantiate grammar matcher according to generation config.
// This step is overlapped with prefilling the prompt to hide overhead.
let grammarMatcherInitPromise: Promise<void> | undefined = undefined;
const responseFormat = genConfig?.response_format;
if (
  responseFormat?.type === "json_object" ||
  responseFormat?.type === "grammar" ||
  responseFormat?.type === "structural_tag"
) {
  const curResponseFormatKey = this.getResponseFormatKey(responseFormat);
  if (
    curResponseFormatKey === this.responseFormatCacheKey &&
    this.grammarMatcher
  ) {
    // Reuse cached grammar matcher -- just reset its state
    const tGrammarInitStart = performance.now();
    log.info("Reuse grammar matcher.");
    this.grammarMatcher.reset();
    this.curRoundGrammarInitTotalTime =
      (performance.now() - tGrammarInitStart) / 1e3;
  } else {
    // Dispose current, reinitialize, and update cache key
    grammarMatcherInitPromise = new Promise(async (resolve) => {
      const tGrammarInitStart = performance.now();
      log.info("Initialize new grammar matcher.");
      if (this.grammarMatcher) {
        this.grammarMatcher.dispose();
      }
      if (this.xgTokenizerInfo === undefined) {
        log.info("Initialize token table.");
        const rawTokenTable = getTokenTableFromTokenizer(this.tokenizer);
        this.xgTokenizerInfo = await xgr.TokenizerInfo.createTokenizerInfo(
          rawTokenTable,
          this.token_postproc_method,
          this.prepend_space_in_encode,
          this.fullVocabSize,
          this.stopTokens,
        );
        this.grammarCompiler =
          await xgr.GrammarCompiler.createGrammarCompiler(
            this.xgTokenizerInfo,
          );
      }
      const grammar: xgr.CompiledGrammar =
        responseFormat.type === undefined
          ? await this.grammarCompiler!.compileBuiltinJSONGrammar()
          : responseFormat.type === "json_object"
            ? await this.grammarCompiler!.compileJSONSchema(
                responseFormat.schema!,
              )
            : responseFormat.type === "grammar"
              ? await this.grammarCompiler!.compileGrammar(
                  responseFormat.grammar!,
                )
              : await this.grammarCompiler!.compileStructuralTag(
                  responseFormat.structural_tag!,
                );
      this.grammarMatcher =
        await xgr.GrammarMatcher.createGrammarMatcher(grammar);
      grammar.dispose();
      this.responseFormatCacheKey = curResponseFormatKey;
      this.curRoundGrammarInitTotalTime =
        (performance.now() - tGrammarInitStart) / 1e3;
      resolve();
    });
  }
}

Bitmask Application (sampleTokenFromLogits)

Source: src/llm_chat.ts, lines 1207-1254

const grammarConstrained =
  response_format?.type === "json_object" ||
  response_format?.type === "grammar" ||
  response_format?.type === "structural_tag";

// 0. Update logitsOnGPU with on-GPU grammar bitmasking
if (grammarConstrained) {
  const grammarBitmaskBegin = performance.now();

  this.tvm.beginScope();
  if (this.grammarMatcher === undefined) {
    throw Error("Expect grammar matcher to be initialized.");
  }

  const tBitmaskStart = performance.now();
  const bitMaskOnCPU: Int32Array =
    await this.grammarMatcher.getNextTokenBitmask();
  this.curRoundGrammarPerTokenTotalTime +=
    (performance.now() - tBitmaskStart) / 1e3;

  if (bitMaskOnCPU.length !== this.bitmaskSize) {
    throw new Error(
      `InternalError: Expect grammar bitmask to be ` +
        `size ${this.bitmaskSize}, but got ${bitMaskOnCPU.length}.`,
    );
  }
  const bitMaskOnGPU = this.tvm
    .empty([1, this.bitmaskSize], "int32", this.device)
    .copyFrom(bitMaskOnCPU);
  const seqIdsArray = this.tvm
    .empty([1], "int32", this.device)
    .copyFrom([0]);
  this.fapplyBitmask(
    logitsOnGPU.view([1, this.fullVocabSize]),
    seqIdsArray,
    bitMaskOnGPU,
  );
  this.tvm.endScope();
}

Token Acceptance

Source: src/llm_chat.ts, lines 1472-1484

// 6. Update grammar matcher with new token
if (grammarConstrained) {
  if (this.grammarMatcher === undefined) {
    throw Error("Expect grammar matcher to be initialized.");
  }
  const tAcceptStart = performance.now();
  const accepted = this.grammarMatcher.acceptToken(sampledToken);
  this.curRoundGrammarPerTokenTotalTime +=
    (performance.now() - tAcceptStart) / 1e3;
  if (!accepted) {
    throw Error("Grammar matcher rejected the newly sampled token.");
  }
}

I/O Contract

Direction Type Description
Input ChatCompletionRequest with response_format A request containing a schema, grammar, or structural tag definition
Output ChatCompletion A completion response whose choices[].message.content is guaranteed to conform to the specified grammar

Internal Data Flow

Step Operation Data Type Location
1 Schema compilation ResponseFormat -> CompiledGrammar -> GrammarMatcher prefillStep()
2 Bitmask computation GrammarMatcher -> Int32Array sampleTokenFromLogits()
3 GPU bitmask application Int32Array -> tvmjs.Tensor -> applied to logits sampleTokenFromLogits()
4 Token sampling Masked logits -> number (token ID) sampleTokenFromLogits()
5 State update number -> grammarMatcher.acceptToken() sampleTokenFromLogits()

External Dependencies

Package Version Components Used
@mlc-ai/web-xgrammar 0.1.27 GrammarMatcher, GrammarCompiler, TokenizerInfo, CompiledGrammar
@mlc-ai/web-runtime (bundled) TVM runtime for GPU tensor operations (tvm.empty(), fapplyBitmask())

Performance Metrics

The implementation records two performance metrics available in usage.extra of the response:

  • grammar_init_s -- Total seconds spent initializing the grammar matcher (compilation + matcher creation). If n > 1 (multiple choices), this is the sum across all choices.
  • grammar_per_token_s -- Average seconds per token spent on bitmask generation and token acceptance. If n > 1, this is the average across all choices.

Usage Examples

Observing Grammar Metrics in a Non-Streaming Request

import * as webllm from "@mlc-ai/web-llm";

const engine = await webllm.CreateMLCEngine(
  "Llama-3.2-3B-Instruct-q4f16_1-MLC",
  { logLevel: "INFO" },
);

const schema = JSON.stringify({
  type: "object",
  properties: {
    name: { type: "string" },
    house: {
      type: "string",
      enum: ["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"],
    },
    alive: { type: "boolean" },
  },
  required: ["name", "house", "alive"],
});

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    {
      role: "user",
      content: "Fill in info about Harry Potter in JSON format.",
    },
  ],
  max_tokens: 128,
  response_format: {
    type: "json_object",
    schema: schema,
  } as webllm.ResponseFormat,
});

// The output is guaranteed to be valid JSON conforming to the schema
console.log("Content:", reply.choices[0].message.content);

// Grammar-specific performance metrics
const extra = reply.usage?.extra;
if (extra) {
  console.log("Grammar init time:", extra.grammar_init_s, "seconds");
  console.log("Grammar per-token time:", extra.grammar_per_token_s, "seconds");
}

Streaming with Grammar Constraints

Grammar-constrained decoding also works with streaming responses. The grammar bitmask is applied at each decode step regardless of whether output is streamed:

import * as webllm from "@mlc-ai/web-llm";

const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");

const schema = JSON.stringify({
  type: "object",
  properties: {
    city: { type: "string" },
    temperature: { type: "number" },
    conditions: { type: "string" },
  },
  required: ["city", "temperature", "conditions"],
});

const chunks = await engine.chat.completions.create({
  stream: true,
  messages: [
    { role: "user", content: "Give me weather info for Tokyo in JSON." },
  ],
  max_tokens: 128,
  response_format: {
    type: "json_object",
    schema: schema,
  } as webllm.ResponseFormat,
});

let fullContent = "";
for await (const chunk of chunks) {
  const delta = chunk.choices[0]?.delta?.content ?? "";
  fullContent += delta;
  process.stdout.write(delta);
}

// fullContent is guaranteed to be valid JSON matching the schema
const parsed = JSON.parse(fullContent);
console.log("\nParsed:", parsed);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment