Principle:Mlc ai Web llm Chat Request Configuration
Overview
Chat Request Configuration is the technique of constructing structured request objects that specify conversation context and generation parameters for LLM inference. The request object follows the OpenAI Chat Completion API specification, enabling drop-in compatibility with existing OpenAI SDK code.
Description
Chat request configuration involves assembling an OpenAI-compatible request object that encodes:
- Conversation history -- An ordered array of messages with roles (
system,user,assistant,tool) that represent the full dialog context - Generation parameters -- Controls for the autoregressive sampling process including temperature, top_p, max_tokens, and stop sequences
- Penalty parameters -- Frequency penalty, presence penalty, and repetition penalty to discourage repetitive outputs
- Output constraints -- Streaming mode toggle, response format specification (text, JSON, grammar, structural tags), and tool definitions for function calling
- Diagnostic options -- Logprob output, seed for deterministic generation, and latency breakdown reporting
The messages array is the primary mechanism for maintaining conversation state across turns. Each message has a role and content field:
- system -- Sets the model's behavior and personality; must appear first if present
- user -- The human's input; can be a string or an array of content parts (text + images for VLM models)
- assistant -- The model's previous responses; may include
tool_callsfor function calling - tool -- Results from tool invocations, linked by
tool_call_id
The last message in the array must always be from user or tool.
Usage
Use chat request configuration when preparing a request for chat completion inference. The request object is the primary interface between application code and the inference engine.
Common patterns:
- Single-turn chat -- One system message followed by one user message
- Multi-turn chat -- Full conversation history including alternating user and assistant messages; web-llm detects multi-round conversations and reuses the KV cache when possible
- Function calling -- Include
toolsarray with function definitions; the model may generatetool_callsin its response - Structured output -- Set
response_formattojson_objectwith an optional JSON schema, orgrammarwith an EBNF string, to constrain the output format - Streaming -- Set
stream: trueto receive tokens incrementally as they are generated
Theoretical Basis
The request object follows the OpenAI Chat Completion API specification. Key parameters control the generation process:
Sampling Parameters
- temperature (0 to 2) -- Controls randomness of sampling. Higher values (0.8) produce more diverse outputs; lower values (0.2) produce more focused, deterministic outputs. At temperature 0, generation is approximately greedy.
- top_p (0 to 1) -- Nucleus sampling: considers only tokens comprising the top
pprobability mass. For example,top_p: 0.1means only the top 10% most likely tokens are considered. - seed -- When set to an integer, enables deterministic generation. Repeated requests with the same seed and parameters return the same result. Seeding is per-request, not per-choice.
Penalty Parameters
- frequency_penalty (-2.0 to 2.0) -- Penalizes tokens based on their frequency in the generated text so far, reducing verbatim repetition
- presence_penalty (-2.0 to 2.0) -- Penalizes tokens that have appeared at all in the generated text, encouraging topic diversity
- repetition_penalty (> 0) -- A multiplicative penalty applied to tokens already seen; values > 1 discourage repetition, values < 1 encourage it
Output Control
- max_tokens -- Hard limit on the number of generated tokens
- stop -- One or more sequences that, when generated, terminate the response
- n -- Number of independent completions to generate (only
n=1is supported in streaming mode) - logit_bias -- A dictionary mapping token IDs to bias values (-100 to 100) that are added to the logits before sampling
Response Format
The response_format field supports:
- text -- Default free-form text generation
- json_object -- Guarantees valid JSON output; optionally accepts a
schemato further constrain structure - grammar -- Constrains output to match an EBNF grammar string
- structural_tag -- Applies trigger-based constraints with tag-delimited blocks