Principle:Mlc ai Web llm Structured Output Parsing

Overview

Structured Output Parsing is the pattern for reliably extracting typed application objects from grammar-constrained LLM output. When grammar-constrained decoding guarantees that the model output conforms to a JSON Schema, the output string can be directly parsed with JSON.parse() without error handling for malformed JSON. This eliminates the traditional LLM output parsing challenge where free-form text may contain invalid JSON, mismatched brackets, trailing commas, or other syntax errors.

Description

In conventional LLM application development, parsing structured output from a language model is fragile. The model may produce text that looks like JSON but contains syntax errors, extra commentary, or unexpected structure. Developers typically resort to:

Try/catch blocks around JSON.parse() with retry logic
Regex-based extraction of JSON from surrounding text
Custom parsers that attempt to fix common LLM JSON errors
Prompting the model to "only output valid JSON" (unreliable)

Structured output parsing in web-llm eliminates all of these workarounds. Because grammar-constrained decoding (via GrammarMatcher) enforces syntactic validity at the token level, the resulting string is guaranteed to:

Be valid JSON (parseable by JSON.parse() without errors)
Conform to the specified JSON Schema structure (correct property names, types, required fields)
Contain no extraneous text before or after the JSON (the grammar constrains the entire output)

This guarantee transforms output parsing from a defensive programming exercise into a straightforward type cast.

The Parsing Pattern

The structured output parsing pattern has three steps:

Send request with response_format specifying json_object + schema
Receive response as a standard ChatCompletion object
Parse directly: JSON.parse(response.choices[0].message.content)

No try/catch is needed. No regex extraction. No validation. The grammar has already ensured correctness.

Structural Tag Parsing

For structural_tag mode, the pattern is slightly different. The output contains free-form text with grammar-constrained regions delimited by tags (e.g. <tool_call>...</tool_call>). The parsing pattern is:

Extract tag-delimited regions using regex or string matching
Parse the JSON content within each tag region with JSON.parse()

The JSON within each tag region is guaranteed to be valid and schema-conforming, but the surrounding free-form text requires standard string processing to locate the tag boundaries.

Usage

Use structured output parsing when:

You have obtained a response from a grammar-constrained inference request (any response_format with type: "json_object" and a schema).
You need to convert the string output into a typed JavaScript/TypeScript object for application logic.
You want to eliminate all JSON parsing error handling from your application code.

Do not use this pattern when:

The request did not include response_format with grammar constraints -- free-form text output may not be valid JSON.
finish_reason is "length" -- the output may be truncated mid-token, resulting in incomplete JSON. Always check finish_reason.

Theoretical Basis

The reliability of this parsing pattern rests on a formal guarantee from the grammar-constrained decoding algorithm:

Theorem: If grammar-constrained decoding completes with finish_reason: "stop", the generated token sequence, when decoded to a string, is a member of the language defined by the grammar.

This follows from the decoding algorithm's invariants:

Initialization: The grammar matcher starts in the grammar's initial state.
Per-token invariant: At each step, only tokens that lead to valid partial parses are available for sampling. The bitmask enforces this.
Termination: The grammar matcher signals completion only when the current parse state is an accepting state of the grammar.
Concatenation: The decoded string is the concatenation of all accepted tokens, which by the invariant forms a valid parse.

For JSON Schema grammars specifically, this means:

Every property name and string value is properly quoted
Every numeric value has valid syntax
Boolean values are exactly true or false
Arrays and objects have matching brackets/braces
Required properties are present
Enum values match one of the specified options

Caveat: finish_reason

The guarantee holds only when finish_reason is "stop" (natural grammar completion). If finish_reason is "length" (hit max_tokens or context window limit), the output may be a valid prefix of the grammar but not a complete parse. In this case, JSON.parse() may fail.

Performance Profiling

The usage statistics in the response include grammar-specific metrics:

usage.extra.grammar_init_s -- Time spent compiling the grammar (seconds). Cached across requests with the same schema.
usage.extra.grammar_per_token_s -- Average per-token time for bitmask computation and token acceptance (seconds).

These metrics allow developers to profile the overhead of grammar-constrained decoding and make informed decisions about schema complexity.

Usage Examples

Direct JSON Parsing After Constrained Inference

import * as webllm from "@mlc-ai/web-llm";

// Define the expected TypeScript type
interface PersonRecord {
  name: string;
  age: number;
  is_student: boolean;
}

const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");

const schema = JSON.stringify({
  type: "object",
  properties: {
    name: { type: "string" },
    age: { type: "integer" },
    is_student: { type: "boolean" },
  },
  required: ["name", "age", "is_student"],
});

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    {
      role: "user",
      content:
        "Generate a JSON object for a person named Alice who is 30 and not a student.",
    },
  ],
  max_tokens: 128,
  response_format: {
    type: "json_object",
    schema: schema,
  } as webllm.ResponseFormat,
});

// Grammar guarantee: JSON.parse() will succeed
const content = reply.choices[0].message.content!;
const person: PersonRecord = JSON.parse(content);

// Use typed object directly in application logic
console.log(`Name: ${person.name}`);        // "Alice"
console.log(`Age: ${person.age}`);           // 30
console.log(`Student: ${person.is_student}`); // false

Checking finish_reason Before Parsing

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    { role: "user", content: "Generate a complex nested JSON object." },
  ],
  max_tokens: 64, // may be too short for the full JSON
  response_format: {
    type: "json_object",
    schema: myComplexSchema,
  } as webllm.ResponseFormat,
});

const choice = reply.choices[0];

if (choice.finish_reason === "stop") {
  // Grammar completed successfully -- safe to parse
  const result = JSON.parse(choice.message.content!);
  processResult(result);
} else if (choice.finish_reason === "length") {
  // Output was truncated -- JSON may be incomplete
  console.warn("Output truncated. Increase max_tokens or simplify schema.");
}

Parsing Structural Tag Output

import * as webllm from "@mlc-ai/web-llm";

type ToolInvocation = {
  name: string;
  arguments: Record<string, unknown>;
};

function parseToolCallBlocks(content: string): ToolInvocation[] {
  const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
  const calls: ToolInvocation[] = [];
  let match: RegExpExecArray | null;
  while ((match = regex.exec(content)) !== null) {
    // JSON within <tool_call> tags is guaranteed valid by grammar
    const payload = JSON.parse(match[1]);
    calls.push({ name: payload.name, arguments: payload.arguments });
  }
  return calls;
}

// After obtaining a structural_tag response:
const reply = await engine.chat.completions.create({
  stream: false,
  messages: [...],
  max_tokens: 1024,
  response_format: {
    type: "structural_tag",
    structural_tag: mcpStructuralTag,
  },
});

const content = reply.choices[0].message.content!;
const toolCalls = parseToolCallBlocks(content);
for (const call of toolCalls) {
  console.log(`Tool: ${call.name}, Args:`, call.arguments);
}

Accessing Performance Metrics

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [{ role: "user", content: "Generate person info in JSON." }],
  max_tokens: 128,
  response_format: {
    type: "json_object",
    schema: personSchema,
  } as webllm.ResponseFormat,
});

// Performance metrics for grammar-constrained decoding
const usage = reply.usage;
console.log("Prompt tokens:", usage?.prompt_tokens);
console.log("Completion tokens:", usage?.completion_tokens);

const extra = usage?.extra;
if (extra) {
  console.log("Grammar init (s):", extra.grammar_init_s);
  console.log("Grammar per-token (s):", extra.grammar_per_token_s);
  console.log("End-to-end latency (s):", extra.e2e_latency_s);
  console.log("Prefill tokens/s:", extra.prefill_tokens_per_s);
  console.log("Decode tokens/s:", extra.decode_tokens_per_s);
}

Related Pages

Implementation: JSON Parse Output -- Implementation:Mlc_ai_Web_llm_JSON_Parse_Output
Principle: Schema Definition -- Defines the schemas that make guaranteed parsing possible
Principle: Grammar-Constrained Decoding -- The algorithm that provides the syntactic validity guarantee
Implementation: Response Format -- The interface through which schemas are specified

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment