Implementation:Mlc ai Web llm JSON Parse Output
Overview
JSON Parse Output is a user-code pattern (not a library API) for parsing guaranteed-valid JSON output from grammar-constrained LLM inference in @mlc-ai/web-llm. After grammar-constrained inference with response_format: { type: "json_object", schema: "..." }, the response.choices[0].message.content string is guaranteed to be valid JSON conforming to the specified schema. Users call JSON.parse() directly and cast to their TypeScript types without error handling for malformed JSON.
Description
This is a pattern document describing how application code consumes the output of grammar-constrained decoding. Unlike the other implementation pages in this wiki, this does not describe a library API -- it describes the canonical user-code pattern that leverages the guarantees provided by the ResponseFormat interface and GrammarMatcher decoding.
The Core Pattern
The pattern has three phases:
Phase 1: Define schema and send request
- Define a JSON Schema string describing the desired output structure.
- Optionally define a TypeScript interface matching the schema for type safety.
- Send a
ChatCompletionRequestwithresponse_format: { type: "json_object", schema }.
Phase 2: Receive and validate response metadata
- Await the
ChatCompletionresponse. - Check
choices[0].finish_reason-- the grammar guarantee only holds whenfinish_reason === "stop". - If
finish_reason === "length", the output may be truncated and parsing may fail.
Phase 3: Parse and use
- Call
JSON.parse(choices[0].message.content!)directly. - Cast to the TypeScript type:
const result = JSON.parse(content) as MyType. - Use the typed object in application logic.
Why No Error Handling Is Needed
Traditional LLM output parsing requires defensive coding:
// Traditional pattern -- NOT needed with grammar-constrained decoding
let result;
try {
result = JSON.parse(response.choices[0].message.content!);
} catch (e) {
// Retry, attempt to fix, or give up
console.error("LLM produced invalid JSON:", e);
}
With grammar-constrained decoding, the GrammarMatcher has already verified every token against the grammar. The output string is guaranteed valid JSON matching the schema. The JSON.parse() call cannot fail (assuming finish_reason === "stop").
Structural Tag Variant
For structural_tag mode, the output is not pure JSON. It contains free-form text with grammar-constrained regions delimited by tags. The parsing pattern uses regex or string matching to extract the tag-delimited regions, then applies JSON.parse() to each region's content:
// Structural tag parsing pattern
const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
let match;
while ((match = regex.exec(content)) !== null) {
const payload = JSON.parse(match[1]); // guaranteed valid by grammar
// process payload...
}
Code Reference
Response Content Location
The parsed content comes from the standard OpenAI-compatible response structure:
Source: src/openai_api_protocols/chat_completion.ts
// ChatCompletion response structure (simplified)
interface ChatCompletion {
choices: Array<{
message: {
content: string | null; // <-- the grammar-constrained output
role: "assistant";
};
finish_reason: "stop" | "length" | "tool_calls" | "abort" | null;
index: number;
}>;
usage?: {
prompt_tokens: number;
completion_tokens: number;
total_tokens: number;
extra?: {
// Standard performance metrics
e2e_latency_s: number;
prefill_tokens_per_s: number;
decode_tokens_per_s: number;
time_to_first_token_s: number;
time_per_output_token_s: number;
// Grammar-specific metrics (present when response_format is set)
grammar_init_s?: number;
grammar_per_token_s?: number;
};
};
}
Performance Metrics
Source: src/openai_api_protocols/chat_completion.ts, lines 1006-1015
// Available in usage.extra when grammar-constrained decoding is used:
/**
* Seconds spent on initializing grammar matcher for structured output.
* If n > 1, it is the sum over all choices.
*/
grammar_init_s?: number;
/**
* Seconds per-token that grammar matcher spent on creating bitmask
* and accepting token for structured output.
* If n > 1, it is the average over all choices.
*/
grammar_per_token_s?: number;
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | ChatCompletion response |
The response object from engine.chat.completions.create() or engine.chatCompletion()
|
| Output | Parsed JavaScript object | The result of JSON.parse(response.choices[0].message.content!), cast to the application's TypeScript type
|
Preconditions
| Condition | Rationale |
|---|---|
Request included response_format with type: "json_object" and schema |
Grammar-constrained decoding must have been active for the parsing guarantee to hold |
finish_reason === "stop" |
If "length", the output may be truncated and JSON.parse() may fail
|
choices[0].message.content is not null |
Content should always be present for non-tool-call completions |
Usage Examples
Example 1: Complete End-to-End Flow with Type Safety
import * as webllm from "@mlc-ai/web-llm";
import { Type, Static } from "@sinclair/typebox";
// Step 1: Define schema using TypeBox for type safety
const PersonSchema = Type.Object({
name: Type.String(),
house: Type.Enum({
Gryffindor: "Gryffindor",
Hufflepuff: "Hufflepuff",
Ravenclaw: "Ravenclaw",
Slytherin: "Slytherin",
}),
blood_status: Type.Enum({
"Pure-blood": "Pure-blood",
"Half-blood": "Half-blood",
"Muggle-born": "Muggle-born",
}),
wand: Type.Object({
wood: Type.String(),
core: Type.String(),
length: Type.Number(),
}),
alive: Type.Boolean(),
patronus: Type.String(),
});
type Person = Static<typeof PersonSchema>;
const schemaString = JSON.stringify(PersonSchema);
// Step 2: Create engine and send request
const engine = await webllm.CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{
role: "user",
content:
"Hermione Granger is a character in Harry Potter. " +
"Fill in the following information about this character in JSON format. " +
"Name, house, blood status, occupation, wand details, alive status, patronus.",
},
],
max_tokens: 256,
response_format: {
type: "json_object",
schema: schemaString,
} as webllm.ResponseFormat,
});
// Step 3: Parse with full type safety -- no try/catch needed
const choice = reply.choices[0];
if (choice.finish_reason === "stop") {
const person: Person = JSON.parse(choice.message.content!);
// Use the fully typed object
console.log(`${person.name} is from ${person.house}`);
console.log(`Blood status: ${person.blood_status}`);
console.log(`Wand: ${person.wand.wood} wood, ${person.wand.core} core, ${person.wand.length} inches`);
console.log(`Alive: ${person.alive}`);
console.log(`Patronus: ${person.patronus}`);
}
// Step 4: Inspect performance metrics
const extra = reply.usage?.extra;
console.log("Grammar init:", extra?.grammar_init_s, "s");
console.log("Grammar per-token:", extra?.grammar_per_token_s, "s");
Example 2: Function Calling with Schema-Constrained Arguments
import * as webllm from "@mlc-ai/web-llm";
import { Type, Static } from "@sinclair/typebox";
// Define a schema for tool call output
const ToolCallSchema = Type.Object({
tool_calls: Type.Array(
Type.Object({
arguments: Type.Any(),
name: Type.String(),
}),
),
});
type ToolCallResponse = Static<typeof ToolCallSchema>;
const schema = JSON.stringify(ToolCallSchema);
const engine = await webllm.CreateMLCEngine("Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC");
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{
role: "system",
content: `You are a function calling AI. Return a JSON object with tool_calls array. Schema: ${schema}`,
},
{
role: "user",
content: "What is the weather in Pittsburgh and Tokyo?",
},
],
max_tokens: 256,
response_format: {
type: "json_object",
schema: schema,
} as webllm.ResponseFormat,
});
// Parse the guaranteed-valid JSON
const result: ToolCallResponse = JSON.parse(reply.choices[0].message.content!);
// Iterate over tool calls with full type safety
for (const call of result.tool_calls) {
console.log(`Call function: ${call.name}`);
console.log(`Arguments:`, call.arguments);
}
Example 3: Structural Tag Parsing for MCP Tool Calls
import * as webllm from "@mlc-ai/web-llm";
type ToolInvocation = {
name: string;
arguments: Record<string, unknown>;
};
// Parse tool call blocks from structural tag output
function parseToolCallBlocks(content: string): ToolInvocation[] {
const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
const calls: ToolInvocation[] = [];
let match: RegExpExecArray | null;
while ((match = regex.exec(content)) !== null) {
// JSON within tags is grammar-guaranteed valid
const payload = JSON.parse(match[1]);
if (typeof payload.name === "string" && payload.arguments !== undefined) {
calls.push({ name: payload.name, arguments: payload.arguments });
}
}
return calls;
}
const engine = await webllm.CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");
const tools = [
{
name: "get_weather",
schema: {
type: "object",
properties: {
location: { type: "string" },
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
];
const responseFormat: webllm.ResponseFormat = {
type: "structural_tag",
structural_tag: {
type: "structural_tag",
format: {
type: "triggered_tags",
triggers: ["<tool_call>"],
tags: tools.map((tool) => ({
begin: `<tool_call>\n{"name": "${tool.name}", "arguments": `,
content: { type: "json_schema", json_schema: tool.schema },
end: "}\n</tool_call>",
})),
at_least_one: true,
stop_after_first: false,
},
},
};
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{ role: "system", content: "Use tools via <tool_call> blocks." },
{ role: "user", content: "What is the weather in Paris?" },
],
max_tokens: 512,
response_format: responseFormat,
});
const content = reply.choices[0].message.content!;
const toolCalls = parseToolCallBlocks(content);
for (const call of toolCalls) {
console.log(`Tool: ${call.name}`);
console.log(`Arguments:`, call.arguments);
// Execute the tool with the guaranteed-valid arguments...
}
Example 4: Batch Processing with Consistent Schema
import * as webllm from "@mlc-ai/web-llm";
interface SentimentResult {
sentiment: "positive" | "negative" | "neutral";
confidence: number;
keywords: string[];
}
const schema = JSON.stringify({
type: "object",
properties: {
sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
confidence: { type: "number" },
keywords: { type: "array", items: { type: "string" } },
},
required: ["sentiment", "confidence", "keywords"],
});
const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");
const texts = [
"The product exceeded my expectations! Excellent quality.",
"Terrible service, would not recommend to anyone.",
"The item arrived on time and works as described.",
];
const results: SentimentResult[] = [];
for (const text of texts) {
const reply = await engine.chat.completions.create({
stream: false,
messages: [
{
role: "user",
content: `Analyze the sentiment of the following text and return the result as JSON:\n\n"${text}"`,
},
],
max_tokens: 128,
response_format: {
type: "json_object",
schema: schema,
} as webllm.ResponseFormat,
});
// Grammar matcher is cached after the first request (same schema),
// so subsequent requests only pay reset cost, not compilation cost
const result: SentimentResult = JSON.parse(reply.choices[0].message.content!);
results.push(result);
// First request shows full grammar_init_s; subsequent show near-zero
console.log("Grammar init:", reply.usage?.extra?.grammar_init_s, "s");
}
console.log("Results:", results);
Related Pages
- Principle: Structured Output Parsing -- Principle:Mlc_ai_Web_llm_Structured_Output_Parsing
- Implementation: Response Format -- The interface used to specify schemas in requests
- Implementation: Grammar Matcher Decoding -- The decoding mechanism that provides the parsing guarantee
- Principle: Schema Definition -- The upstream principle for defining output structure constraints
- Heuristic:Mlc_ai_Web_llm_Grammar_Matcher_Reuse