Principle:Mlc ai Web llm Chat Request Configuration

Overview

Chat Request Configuration is the technique of constructing structured request objects that specify conversation context and generation parameters for LLM inference. The request object follows the OpenAI Chat Completion API specification, enabling drop-in compatibility with existing OpenAI SDK code.

Description

Chat request configuration involves assembling an OpenAI-compatible request object that encodes:

Conversation history -- An ordered array of messages with roles (system, user, assistant, tool) that represent the full dialog context
Generation parameters -- Controls for the autoregressive sampling process including temperature, top_p, max_tokens, and stop sequences
Penalty parameters -- Frequency penalty, presence penalty, and repetition penalty to discourage repetitive outputs
Output constraints -- Streaming mode toggle, response format specification (text, JSON, grammar, structural tags), and tool definitions for function calling
Diagnostic options -- Logprob output, seed for deterministic generation, and latency breakdown reporting

The messages array is the primary mechanism for maintaining conversation state across turns. Each message has a role and content field:

system -- Sets the model's behavior and personality; must appear first if present
user -- The human's input; can be a string or an array of content parts (text + images for VLM models)
assistant -- The model's previous responses; may include tool_calls for function calling
tool -- Results from tool invocations, linked by tool_call_id

The last message in the array must always be from user or tool.

Usage

Use chat request configuration when preparing a request for chat completion inference. The request object is the primary interface between application code and the inference engine.

Common patterns:

Single-turn chat -- One system message followed by one user message
Multi-turn chat -- Full conversation history including alternating user and assistant messages; web-llm detects multi-round conversations and reuses the KV cache when possible
Function calling -- Include tools array with function definitions; the model may generate tool_calls in its response
Structured output -- Set response_format to json_object with an optional JSON schema, or grammar with an EBNF string, to constrain the output format
Streaming -- Set stream: true to receive tokens incrementally as they are generated

Theoretical Basis

The request object follows the OpenAI Chat Completion API specification. Key parameters control the generation process:

Sampling Parameters

temperature (0 to 2) -- Controls randomness of sampling. Higher values (0.8) produce more diverse outputs; lower values (0.2) produce more focused, deterministic outputs. At temperature 0, generation is approximately greedy.
top_p (0 to 1) -- Nucleus sampling: considers only tokens comprising the top p probability mass. For example, top_p: 0.1 means only the top 10% most likely tokens are considered.
seed -- When set to an integer, enables deterministic generation. Repeated requests with the same seed and parameters return the same result. Seeding is per-request, not per-choice.

Penalty Parameters

frequency_penalty (-2.0 to 2.0) -- Penalizes tokens based on their frequency in the generated text so far, reducing verbatim repetition
presence_penalty (-2.0 to 2.0) -- Penalizes tokens that have appeared at all in the generated text, encouraging topic diversity
repetition_penalty (> 0) -- A multiplicative penalty applied to tokens already seen; values > 1 discourage repetition, values < 1 encourage it

Output Control

max_tokens -- Hard limit on the number of generated tokens
stop -- One or more sequences that, when generated, terminate the response
n -- Number of independent completions to generate (only n=1 is supported in streaming mode)
logit_bias -- A dictionary mapping token IDs to bias values (-100 to 100) that are added to the logits before sampling

Response Format

The response_format field supports:

text -- Default free-form text generation
json_object -- Guarantees valid JSON output; optionally accepts a schema to further constrain structure
grammar -- Constrains output to match an EBNF grammar string
structural_tag -- Applies trigger-based constraints with tag-delimited blocks

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment