Implementation:Mlc ai Web llm JSON Parse Output

Overview

JSON Parse Output is a user-code pattern (not a library API) for parsing guaranteed-valid JSON output from grammar-constrained LLM inference in @mlc-ai/web-llm. After grammar-constrained inference with response_format: { type: "json_object", schema: "..." }, the response.choices[0].message.content string is guaranteed to be valid JSON conforming to the specified schema. Users call JSON.parse() directly and cast to their TypeScript types without error handling for malformed JSON.

Description

This is a pattern document describing how application code consumes the output of grammar-constrained decoding. Unlike the other implementation pages in this wiki, this does not describe a library API -- it describes the canonical user-code pattern that leverages the guarantees provided by the ResponseFormat interface and GrammarMatcher decoding.

The Core Pattern

The pattern has three phases:

Phase 1: Define schema and send request

Define a JSON Schema string describing the desired output structure.
Optionally define a TypeScript interface matching the schema for type safety.
Send a ChatCompletionRequest with response_format: { type: "json_object", schema }.

Phase 2: Receive and validate response metadata

Await the ChatCompletion response.
Check choices[0].finish_reason -- the grammar guarantee only holds when finish_reason === "stop".
If finish_reason === "length", the output may be truncated and parsing may fail.

Phase 3: Parse and use

Call JSON.parse(choices[0].message.content!) directly.
Cast to the TypeScript type: const result = JSON.parse(content) as MyType.
Use the typed object in application logic.

Why No Error Handling Is Needed

Traditional LLM output parsing requires defensive coding:

// Traditional pattern -- NOT needed with grammar-constrained decoding
let result;
try {
  result = JSON.parse(response.choices[0].message.content!);
} catch (e) {
  // Retry, attempt to fix, or give up
  console.error("LLM produced invalid JSON:", e);
}

With grammar-constrained decoding, the GrammarMatcher has already verified every token against the grammar. The output string is guaranteed valid JSON matching the schema. The JSON.parse() call cannot fail (assuming finish_reason === "stop").

Structural Tag Variant

For structural_tag mode, the output is not pure JSON. It contains free-form text with grammar-constrained regions delimited by tags. The parsing pattern uses regex or string matching to extract the tag-delimited regions, then applies JSON.parse() to each region's content:

// Structural tag parsing pattern
const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
let match;
while ((match = regex.exec(content)) !== null) {
  const payload = JSON.parse(match[1]); // guaranteed valid by grammar
  // process payload...
}

Code Reference

Response Content Location

The parsed content comes from the standard OpenAI-compatible response structure:

Source: src/openai_api_protocols/chat_completion.ts

// ChatCompletion response structure (simplified)
interface ChatCompletion {
  choices: Array<{
    message: {
      content: string | null;  // <-- the grammar-constrained output
      role: "assistant";
    };
    finish_reason: "stop" | "length" | "tool_calls" | "abort" | null;
    index: number;
  }>;
  usage?: {
    prompt_tokens: number;
    completion_tokens: number;
    total_tokens: number;
    extra?: {
      // Standard performance metrics
      e2e_latency_s: number;
      prefill_tokens_per_s: number;
      decode_tokens_per_s: number;
      time_to_first_token_s: number;
      time_per_output_token_s: number;
      // Grammar-specific metrics (present when response_format is set)
      grammar_init_s?: number;
      grammar_per_token_s?: number;
    };
  };
}

Performance Metrics

Source: src/openai_api_protocols/chat_completion.ts, lines 1006-1015

// Available in usage.extra when grammar-constrained decoding is used:

/**
 * Seconds spent on initializing grammar matcher for structured output.
 * If n > 1, it is the sum over all choices.
 */
grammar_init_s?: number;

/**
 * Seconds per-token that grammar matcher spent on creating bitmask
 * and accepting token for structured output.
 * If n > 1, it is the average over all choices.
 */
grammar_per_token_s?: number;

I/O Contract

Direction	Type	Description
Input	`ChatCompletion` response	The response object from `engine.chat.completions.create()` or `engine.chatCompletion()`
Output	Parsed JavaScript object	The result of `JSON.parse(response.choices[0].message.content!)`, cast to the application's TypeScript type

Preconditions

Condition	Rationale
Request included `response_format` with `type: "json_object"` and `schema`	Grammar-constrained decoding must have been active for the parsing guarantee to hold
`finish_reason === "stop"`	If `"length"`, the output may be truncated and `JSON.parse()` may fail
`choices[0].message.content` is not `null`	Content should always be present for non-tool-call completions

Usage Examples

Example 1: Complete End-to-End Flow with Type Safety

import * as webllm from "@mlc-ai/web-llm";
import { Type, Static } from "@sinclair/typebox";

// Step 1: Define schema using TypeBox for type safety
const PersonSchema = Type.Object({
  name: Type.String(),
  house: Type.Enum({
    Gryffindor: "Gryffindor",
    Hufflepuff: "Hufflepuff",
    Ravenclaw: "Ravenclaw",
    Slytherin: "Slytherin",
  }),
  blood_status: Type.Enum({
    "Pure-blood": "Pure-blood",
    "Half-blood": "Half-blood",
    "Muggle-born": "Muggle-born",
  }),
  wand: Type.Object({
    wood: Type.String(),
    core: Type.String(),
    length: Type.Number(),
  }),
  alive: Type.Boolean(),
  patronus: Type.String(),
});
type Person = Static<typeof PersonSchema>;
const schemaString = JSON.stringify(PersonSchema);

// Step 2: Create engine and send request
const engine = await webllm.CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    {
      role: "user",
      content:
        "Hermione Granger is a character in Harry Potter. " +
        "Fill in the following information about this character in JSON format. " +
        "Name, house, blood status, occupation, wand details, alive status, patronus.",
    },
  ],
  max_tokens: 256,
  response_format: {
    type: "json_object",
    schema: schemaString,
  } as webllm.ResponseFormat,
});

// Step 3: Parse with full type safety -- no try/catch needed
const choice = reply.choices[0];
if (choice.finish_reason === "stop") {
  const person: Person = JSON.parse(choice.message.content!);

  // Use the fully typed object
  console.log(`${person.name} is from ${person.house}`);
  console.log(`Blood status: ${person.blood_status}`);
  console.log(`Wand: ${person.wand.wood} wood, ${person.wand.core} core, ${person.wand.length} inches`);
  console.log(`Alive: ${person.alive}`);
  console.log(`Patronus: ${person.patronus}`);
}

// Step 4: Inspect performance metrics
const extra = reply.usage?.extra;
console.log("Grammar init:", extra?.grammar_init_s, "s");
console.log("Grammar per-token:", extra?.grammar_per_token_s, "s");

Example 2: Function Calling with Schema-Constrained Arguments

import * as webllm from "@mlc-ai/web-llm";
import { Type, Static } from "@sinclair/typebox";

// Define a schema for tool call output
const ToolCallSchema = Type.Object({
  tool_calls: Type.Array(
    Type.Object({
      arguments: Type.Any(),
      name: Type.String(),
    }),
  ),
});
type ToolCallResponse = Static<typeof ToolCallSchema>;
const schema = JSON.stringify(ToolCallSchema);

const engine = await webllm.CreateMLCEngine("Hermes-2-Pro-Llama-3-8B-q4f16_1-MLC");

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    {
      role: "system",
      content: `You are a function calling AI. Return a JSON object with tool_calls array. Schema: ${schema}`,
    },
    {
      role: "user",
      content: "What is the weather in Pittsburgh and Tokyo?",
    },
  ],
  max_tokens: 256,
  response_format: {
    type: "json_object",
    schema: schema,
  } as webllm.ResponseFormat,
});

// Parse the guaranteed-valid JSON
const result: ToolCallResponse = JSON.parse(reply.choices[0].message.content!);

// Iterate over tool calls with full type safety
for (const call of result.tool_calls) {
  console.log(`Call function: ${call.name}`);
  console.log(`Arguments:`, call.arguments);
}

Example 3: Structural Tag Parsing for MCP Tool Calls

import * as webllm from "@mlc-ai/web-llm";

type ToolInvocation = {
  name: string;
  arguments: Record<string, unknown>;
};

// Parse tool call blocks from structural tag output
function parseToolCallBlocks(content: string): ToolInvocation[] {
  const regex = /<tool_call>\s*({[\s\S]*?})\s*<\/tool_call>/g;
  const calls: ToolInvocation[] = [];
  let match: RegExpExecArray | null;
  while ((match = regex.exec(content)) !== null) {
    // JSON within tags is grammar-guaranteed valid
    const payload = JSON.parse(match[1]);
    if (typeof payload.name === "string" && payload.arguments !== undefined) {
      calls.push({ name: payload.name, arguments: payload.arguments });
    }
  }
  return calls;
}

const engine = await webllm.CreateMLCEngine("Llama-3.2-1B-Instruct-q4f16_1-MLC");

const tools = [
  {
    name: "get_weather",
    schema: {
      type: "object",
      properties: {
        location: { type: "string" },
        unit: { type: "string", enum: ["celsius", "fahrenheit"] },
      },
      required: ["location"],
    },
  },
];

const responseFormat: webllm.ResponseFormat = {
  type: "structural_tag",
  structural_tag: {
    type: "structural_tag",
    format: {
      type: "triggered_tags",
      triggers: ["<tool_call>"],
      tags: tools.map((tool) => ({
        begin: `<tool_call>\n{"name": "${tool.name}", "arguments": `,
        content: { type: "json_schema", json_schema: tool.schema },
        end: "}\n</tool_call>",
      })),
      at_least_one: true,
      stop_after_first: false,
    },
  },
};

const reply = await engine.chat.completions.create({
  stream: false,
  messages: [
    { role: "system", content: "Use tools via <tool_call> blocks." },
    { role: "user", content: "What is the weather in Paris?" },
  ],
  max_tokens: 512,
  response_format: responseFormat,
});

const content = reply.choices[0].message.content!;
const toolCalls = parseToolCallBlocks(content);

for (const call of toolCalls) {
  console.log(`Tool: ${call.name}`);
  console.log(`Arguments:`, call.arguments);
  // Execute the tool with the guaranteed-valid arguments...
}

Example 4: Batch Processing with Consistent Schema

import * as webllm from "@mlc-ai/web-llm";

interface SentimentResult {
  sentiment: "positive" | "negative" | "neutral";
  confidence: number;
  keywords: string[];
}

const schema = JSON.stringify({
  type: "object",
  properties: {
    sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
    confidence: { type: "number" },
    keywords: { type: "array", items: { type: "string" } },
  },
  required: ["sentiment", "confidence", "keywords"],
});

const engine = await webllm.CreateMLCEngine("Phi-3.5-mini-instruct-q4f16_1-MLC");

const texts = [
  "The product exceeded my expectations! Excellent quality.",
  "Terrible service, would not recommend to anyone.",
  "The item arrived on time and works as described.",
];

const results: SentimentResult[] = [];
for (const text of texts) {
  const reply = await engine.chat.completions.create({
    stream: false,
    messages: [
      {
        role: "user",
        content: `Analyze the sentiment of the following text and return the result as JSON:\n\n"${text}"`,
      },
    ],
    max_tokens: 128,
    response_format: {
      type: "json_object",
      schema: schema,
    } as webllm.ResponseFormat,
  });

  // Grammar matcher is cached after the first request (same schema),
  // so subsequent requests only pay reset cost, not compilation cost
  const result: SentimentResult = JSON.parse(reply.choices[0].message.content!);
  results.push(result);

  // First request shows full grammar_init_s; subsequent show near-zero
  console.log("Grammar init:", reply.usage?.extra?.grammar_init_s, "s");
}

console.log("Results:", results);

Related Pages

Principle: Structured Output Parsing -- Principle:Mlc_ai_Web_llm_Structured_Output_Parsing
Implementation: Response Format -- The interface used to specify schemas in requests
Implementation: Grammar Matcher Decoding -- The decoding mechanism that provides the parsing guarantee
Principle: Schema Definition -- The upstream principle for defining output structure constraints
Heuristic:Mlc_ai_Web_llm_Grammar_Matcher_Reuse

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment