Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mlc ai Web llm Grammar Matcher Reuse

From Leeroopedia
Knowledge Sources
Domains Optimization, Structured_Output, LLMs
Last Updated 2026-02-14 22:00 GMT

Overview

Performance optimization that reuses the grammar matcher across multiple structured output requests with the same schema, avoiding expensive reinitialization.

Description

When using JSON mode or structured output (via `response_format`), WebLLM creates a GrammarMatcher from the xgrammar library to constrain token generation. Grammar initialization involves processing the full token table and compiling the grammar, which is expensive (100-500ms). WebLLM automatically caches the grammar matcher by schema key: if the next request uses the same schema, it calls `grammarMatcher.reset()` instead of disposing and reinitializing, reducing grammar setup time to ~1-10ms.

Usage

Use this heuristic when building applications that make repeated structured output requests with the same JSON schema. Keep the `response_format` object consistent across calls to benefit from the cache. Changing the schema forces a full reinitialization.

The Insight (Rule of Thumb)

  • Action: Keep the same `response_format` schema across multiple chat completion requests to trigger grammar matcher reuse.
  • Value: Reuse path: ~1-10ms grammar init. New grammar path: ~100-500ms grammar init.
  • Trade-off: None when schemas are consistent. If schemas change frequently, each change incurs the full initialization cost.
  • Compatibility: Works with `json_schema`, `json_object`, and `structural_tag` response format types.

Reasoning

Grammar compilation involves converting a JSON schema (or grammar specification) into a compiled grammar, then creating a GrammarMatcher that uses a pre-built token table. The token table processing (`TokenizerInfo.createTokenizerInfo`) and grammar compilation (`GrammarCompiler.createGrammarCompiler`) are one-time costs amortized across reuses. The cache key is derived from the full response format specification, so identical schemas always hit the cache.

Grammar matcher reuse logic from `src/llm_chat.ts:622-657`:

const curResponseFormatKey = this.getResponseFormatKey(responseFormat);
if (
  curResponseFormatKey === this.responseFormatCacheKey &&
  this.grammarMatcher
) {
  // If we did not change the schema and have instantiated a GrammarMatcher, we reuse it.
  const tGrammarInitStart = performance.now();
  log.info("Reuse grammar matcher.");
  this.grammarMatcher.reset();
  this.curRoundGrammarInitTotalTime =
    (performance.now() - tGrammarInitStart) / 1e3;
} else {
  // Else dispose current grammarMatcher, reinitialize, and update this.schema.
  log.info("Initialize new grammar matcher.");
  if (this.grammarMatcher) {
    this.grammarMatcher.dispose();
  }
  // ... full initialization with TokenizerInfo and GrammarCompiler ...
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment