Heuristic:Mlc ai Web llm Grammar Matcher Reuse
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Structured_Output, LLMs |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Performance optimization that reuses the grammar matcher across multiple structured output requests with the same schema, avoiding expensive reinitialization.
Description
When using JSON mode or structured output (via `response_format`), WebLLM creates a GrammarMatcher from the xgrammar library to constrain token generation. Grammar initialization involves processing the full token table and compiling the grammar, which is expensive (100-500ms). WebLLM automatically caches the grammar matcher by schema key: if the next request uses the same schema, it calls `grammarMatcher.reset()` instead of disposing and reinitializing, reducing grammar setup time to ~1-10ms.
Usage
Use this heuristic when building applications that make repeated structured output requests with the same JSON schema. Keep the `response_format` object consistent across calls to benefit from the cache. Changing the schema forces a full reinitialization.
The Insight (Rule of Thumb)
- Action: Keep the same `response_format` schema across multiple chat completion requests to trigger grammar matcher reuse.
- Value: Reuse path: ~1-10ms grammar init. New grammar path: ~100-500ms grammar init.
- Trade-off: None when schemas are consistent. If schemas change frequently, each change incurs the full initialization cost.
- Compatibility: Works with `json_schema`, `json_object`, and `structural_tag` response format types.
Reasoning
Grammar compilation involves converting a JSON schema (or grammar specification) into a compiled grammar, then creating a GrammarMatcher that uses a pre-built token table. The token table processing (`TokenizerInfo.createTokenizerInfo`) and grammar compilation (`GrammarCompiler.createGrammarCompiler`) are one-time costs amortized across reuses. The cache key is derived from the full response format specification, so identical schemas always hit the cache.
Grammar matcher reuse logic from `src/llm_chat.ts:622-657`:
const curResponseFormatKey = this.getResponseFormatKey(responseFormat);
if (
curResponseFormatKey === this.responseFormatCacheKey &&
this.grammarMatcher
) {
// If we did not change the schema and have instantiated a GrammarMatcher, we reuse it.
const tGrammarInitStart = performance.now();
log.info("Reuse grammar matcher.");
this.grammarMatcher.reset();
this.curRoundGrammarInitTotalTime =
(performance.now() - tGrammarInitStart) / 1e3;
} else {
// Else dispose current grammarMatcher, reinitialize, and update this.schema.
log.info("Initialize new grammar matcher.");
if (this.grammarMatcher) {
this.grammarMatcher.dispose();
}
// ... full initialization with TokenizerInfo and GrammarCompiler ...
}