Principle:Mlc ai Web llm Structured Output Parsing
Overview
Structured Output Parsing is the pattern for reliably extracting typed application objects from grammar-constrained LLM output. When grammar-constrained decoding guarantees that the model output conforms to a JSON Schema, the output string can be directly parsed with JSON.parse() without error handling for malformed JSON. This eliminates the traditional LLM output parsing challenge where free-form text may contain invalid JSON, mismatched brackets, trailing commas, or other syntax errors.
Description
In conventional LLM application development, parsing structured output from a language model is fragile. The model may produce text that looks like JSON but contains syntax errors, extra commentary, or unexpected structure. Developers typically resort to:
- Try/catch blocks around
JSON.parse()with retry logic - Regex-based extraction of JSON from surrounding text
- Custom parsers that attempt to fix common LLM JSON errors
- Prompting the model to "only output valid JSON" (unreliable)
Structured output parsing in web-llm eliminates all of these workarounds. Because grammar-constrained decoding (via GrammarMatcher) enforces syntactic validity at the token level, the resulting string is guaranteed to:
- Be valid JSON (parseable by
JSON.parse()without errors) - Conform to the specified JSON Schema structure (correct property names, types, required fields)
- Contain no extraneous text before or after the JSON (the grammar constrains the entire output)
This guarantee transforms output parsing from a defensive programming exercise into a straightforward type cast.
The Parsing Pattern
The structured output parsing pattern has three steps:
- Send request with
response_formatspecifyingjson_object+schema - Receive response as a standard
ChatCompletionobject - Parse directly:
JSON.parse(response.choices[0].message.content)
No try/catch is needed. No regex extraction. No validation. The grammar has already ensured correctness.
Structural Tag Parsing
For structural_tag mode, the pattern is slightly different. The output contains free-form text with grammar-constrained regions delimited by tags (e.g. <tool_call>...</tool_call>). The parsing pattern is:
- Extract tag-delimited regions using regex or string matching
- Parse the JSON content within each tag region with
JSON.parse()
The JSON within each tag region is guaranteed to be valid and schema-conforming, but the surrounding free-form text requires standard string processing to locate the tag boundaries.
Usage
Use structured output parsing when:
- You have obtained a response from a grammar-constrained inference request (any
response_formatwithtype: "json_object"and aschema). - You need to convert the string output into a typed JavaScript/TypeScript object for application logic.
- You want to eliminate all JSON parsing error handling from your application code.
Do not use this pattern when:
- The request did not include
response_formatwith grammar constraints -- free-form text output may not be valid JSON. finish_reasonis"length"-- the output may be truncated mid-token, resulting in incomplete JSON. Always checkfinish_reason.
Theoretical Basis
The reliability of this parsing pattern rests on a formal guarantee from the grammar-constrained decoding algorithm:
Theorem: If grammar-constrained decoding completes with finish_reason: "stop", the generated token sequence, when decoded to a string, is a member of the language defined by the grammar.
This follows from the decoding algorithm's invariants:
- Initialization: The grammar matcher starts in the grammar's initial state.
- Per-token invariant: At each step, only tokens that lead to valid partial parses are available for sampling. The bitmask enforces this.
- Termination: The grammar matcher signals completion only when the current parse state is an accepting state of the grammar.
- Concatenation: The decoded string is the concatenation of all accepted tokens, which by the invariant forms a valid parse.
For JSON Schema grammars specifically, this means:
- Every property name and string value is properly quoted
- Every numeric value has valid syntax
- Boolean values are exactly
trueorfalse - Arrays and objects have matching brackets/braces
- Required properties are present
- Enum values match one of the specified options
Caveat: finish_reason
The guarantee holds only when finish_reason is "stop" (natural grammar completion). If finish_reason is "length" (hit max_tokens or context window limit), the output may be a valid prefix of the grammar but not a complete parse. In this case, JSON.parse() may fail.
Performance Profiling
The usage statistics in the response include grammar-specific metrics:
usage.extra.grammar_init_s-- Time spent compiling the grammar (seconds). Cached across requests with the same schema.usage.extra.grammar_per_token_s-- Average per-token time for bitmask computation and token acceptance (seconds).
These metrics allow developers to profile the overhead of grammar-constrained decoding and make informed decisions about schema complexity.
Usage Examples
Direct JSON Parsing After Constrained Inference
import * as webllm from "@mlc-ai/web-llm";
// Define the expected TypeScript type
interface PersonRecord {
name: string;
age: number;
is_student: boolean;
}
const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");
const schema = JSON.stringify({
type: "object",
properties: {
name: { type: "string" },
age: { type: "integer" },
is_student: { type: "boolean" },
},
required: ["name", "age", "is_student"],
});
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{
role: "user",
content:
"Generate a JSON object for a person named Alice who is 30 and not a student.",
},
],
max_tokens: 128,
response_format: {
type: "json_object",
schema: schema,
} as webllm.ResponseFormat,
});
// Grammar guarantee: JSON.parse() will succeed
const content = reply.choices[0].message.content!;
const person: PersonRecord = JSON.parse(content);
// Use typed object directly in application logic
console.log(`Name: ${person.name}`); // "Alice"
console.log(`Age: ${person.age}`); // 30
console.log(`Student: ${person.is_student}`); // false
Checking finish_reason Before Parsing
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{ role: "user", content: "Generate a complex nested JSON object." },
],
max_tokens: 64, // may be too short for the full JSON
response_format: {
type: "json_object",
schema: myComplexSchema,
} as webllm.ResponseFormat,
});
const choice = reply.choices[0];
if (choice.finish_reason === "stop") {
// Grammar completed successfully -- safe to parse
const result = JSON.parse(choice.message.content!);
processResult(result);
} else if (choice.finish_reason === "length") {
// Output was truncated -- JSON may be incomplete
console.warn("Output truncated. Increase max_tokens or simplify schema.");
}
Parsing Structural Tag Output
import * as webllm from "@mlc-ai/web-llm";
type ToolInvocation = {
name: string;
arguments: Record<string, unknown>;
};
function parseToolCallBlocks(content: string): ToolInvocation[] {
const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
const calls: ToolInvocation[] = [];
let match: RegExpExecArray | null;
while ((match = regex.exec(content)) !== null) {
// JSON within <tool_call> tags is guaranteed valid by grammar
const payload = JSON.parse(match[1]);
calls.push({ name: payload.name, arguments: payload.arguments });
}
return calls;
}
// After obtaining a structural_tag response:
const reply = await engine.chat.completions.create({
stream: false,
messages: [...],
max_tokens: 1024,
response_format: {
type: "structural_tag",
structural_tag: mcpStructuralTag,
},
});
const content = reply.choices[0].message.content!;
const toolCalls = parseToolCallBlocks(content);
for (const call of toolCalls) {
console.log(`Tool: ${call.name}, Args:`, call.arguments);
}
Accessing Performance Metrics
const reply = await engine.chat.completions.create({
stream: false,
messages: [{ role: "user", content: "Generate person info in JSON." }],
max_tokens: 128,
response_format: {
type: "json_object",
schema: personSchema,
} as webllm.ResponseFormat,
});
// Performance metrics for grammar-constrained decoding
const usage = reply.usage;
console.log("Prompt tokens:", usage?.prompt_tokens);
console.log("Completion tokens:", usage?.completion_tokens);
const extra = usage?.extra;
if (extra) {
console.log("Grammar init (s):", extra.grammar_init_s);
console.log("Grammar per-token (s):", extra.grammar_per_token_s);
console.log("End-to-end latency (s):", extra.e2e_latency_s);
console.log("Prefill tokens/s:", extra.prefill_tokens_per_s);
console.log("Decode tokens/s:", extra.decode_tokens_per_s);
}
Related Pages
- Implementation: JSON Parse Output -- Implementation:Mlc_ai_Web_llm_JSON_Parse_Output
- Principle: Schema Definition -- Defines the schemas that make guaranteed parsing possible
- Principle: Grammar-Constrained Decoding -- The algorithm that provides the syntactic validity guarantee
- Implementation: Response Format -- The interface through which schemas are specified