Workflow:Mlc ai Web llm Structured Output Generation

Knowledge Sources	web-llm WebLLM Docs
Domains	LLMs, WebGPU, Structured_Output, Grammar_Constrained_Decoding
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for generating structured, schema-conforming output from an LLM in the browser using JSON mode, JSON schema constraints, or custom EBNF grammars.

Description

This workflow covers grammar-constrained decoding in web-llm, which forces the model to produce output conforming to a specified format. Three constraint modes are supported: JSON mode (any valid JSON), JSON schema (JSON conforming to a specific schema), and EBNF grammar (arbitrary context-free grammar). The constraint is enforced at the token level during decoding via the xgrammar library, which computes valid token masks and applies them before sampling. This guarantees structurally valid output without post-processing or retry loops.

Usage

Execute this workflow when you need the LLM to produce output in a predictable, parseable format. Common use cases include: extracting structured data from text, generating API-compatible JSON responses, building form-filling assistants, or any scenario where downstream code needs to parse the model's output programmatically.

Execution Steps

Step 1: Define the Output Schema

Specify the structure the model output must conform to. Web-llm supports three approaches: a JSON schema string (using standard JSON Schema specification), a JSON mode flag for any valid JSON, or an EBNF grammar string for arbitrary structured formats. Schemas can be defined manually as JSON strings or generated programmatically using libraries like TypeBox.

Key considerations:

JSON Schema supports objects, arrays, strings, numbers, integers, booleans, enums, and nested types
TypeBox or similar libraries can generate schemas from TypeScript type definitions
EBNF grammars provide maximum flexibility for non-JSON formats
The schema is compiled into a token-level constraint mask at runtime

Step 2: Create the Engine

Initialize the MLCEngine with a model that supports grammar-constrained decoding. Most models in the web-llm registry support grammar mode. The engine setup is identical to the basic chat completion workflow, using CreateMLCEngine or CreateWebWorkerMLCEngine.

Key considerations:

Most prebuilt models support grammar constraints out of the box
The xgrammar library (@aspect-build/web-xgrammar) is loaded automatically
No special engine configuration is needed for grammar support

Step 3: Configure the Constrained Request

Build a ChatCompletionRequest with the response_format field set. For JSON schema mode, set type to "json_object" and provide the schema string. For EBNF grammar mode, set type to "grammar" and provide the grammar string. The prompt should instruct the model to produce output in the desired format.

Three format modes:

JSON mode: response_format: { type: "json_object" } — any valid JSON
JSON schema: response_format: { type: "json_object", schema: schemaString } — JSON conforming to schema
EBNF grammar: response_format: { type: "grammar", grammar: grammarString } — matches EBNF rules

Key considerations:

The user prompt should mention the desired output format for best results
Streaming works with all constraint modes
Logprobs and top_logprobs are compatible with constrained decoding

Step 4: Execute Constrained Inference

Call engine.chat.completions.create() or engine.chatCompletion() with the constrained request. The engine compiles the grammar constraint and applies it as a token mask during each decode step. Only tokens that are valid continuations under the grammar are allowed, guaranteeing the output conforms to the specified format.

What happens internally:

The grammar/schema is compiled into an xgrammar GrammarMatcher
During each decode step, the matcher computes a bitmask of valid next tokens
The bitmask is applied to logits before sampling, zeroing out invalid tokens
The matcher state advances as tokens are generated
The output is guaranteed to be valid according to the constraint

Step 5: Parse the Structured Output

Extract and parse the model's output. For JSON mode and JSON schema, the output can be directly parsed with JSON.parse(). For EBNF grammars, apply the appropriate parser for the defined grammar. The usage statistics include extra fields showing grammar compilation time.

Key considerations:

JSON outputs are guaranteed to be valid JSON.parse()-able strings
Schema-constrained outputs conform to the specified schema structure
usage.extra may contain grammar-related timing information
No retry logic is needed since the output is guaranteed to conform

Execution Diagram

GitHub URL

Workflow Repository