Principle:Mlc ai Web llm Streaming Response Processing
Overview
Streaming Response Processing is the technique of consuming incremental token-by-token output from language model inference and assembling it into coherent responses. Rather than waiting for the full generation to complete, streaming delivers each generated token (or small group of tokens) as soon as it is available, enabling real-time UI updates and reduced perceived latency.
Description
Streaming response processing consumes the output of autoregressive decoding as individual token chunks rather than waiting for complete generation. The processing model works as follows:
Streaming Mode
When stream: true is set in the request, the inference engine returns a Promise<AsyncIterable<ChatCompletionChunk>>. Each chunk contains:
- choices[0].delta.content -- The incremental text generated since the last chunk
- choices[0].delta.role -- Set to
"assistant"in the first chunk - choices[0].finish_reason --
nullfor intermediate chunks; one of"stop","length","tool_calls", or"abort"for the final chunk - choices[0].logprobs -- Log probability information if
logprobs: truewas set in the request
The stream produces the following sequence of chunks:
- Prefill chunk -- First chunk after the prefill phase completes, containing the first generated token
- Decode chunks -- One chunk per decode step, each containing the next generated token (skipping chunks when incomplete multi-byte characters like emojis are detected)
- Final chunk -- Contains an empty delta (or tool_calls if function calling) with a non-null
finish_reason - Usage chunk (optional) -- If
stream_options: { include_usage: true }is set, a final chunk with empty choices and populatedusagestatistics
Non-Streaming Mode
When stream is false or unset, the engine buffers all generated tokens internally and returns a single ChatCompletion object containing:
- choices[0].message.content -- The complete generated text
- choices[0].message.role -- Always
"assistant" - choices[0].finish_reason -- The termination reason
- usage -- Complete token statistics and performance metrics
Unicode Handling
The streaming implementation includes special handling for multi-byte Unicode characters (such as emojis). Since each emoji is composed of multiple tokens, partially decoded emojis appear as the Unicode replacement character (U+FFFD). The engine detects trailing replacement characters and skips yielding a chunk until the full character is decoded, preventing broken character display in the UI.
Usage
- Use streaming when building interactive UIs that display tokens as they are generated. This reduces perceived latency because the user sees the first token after only the prefill phase, rather than waiting for the entire generation.
- Use non-streaming when the complete response is needed before processing, such as JSON parsing, tool call extraction, or batch processing scenarios.
Streaming is consumed via the for await...of pattern on the returned async iterable:
const stream = await engine.chat.completions.create({
messages: [{ role: "user", content: "Hello" }],
stream: true,
});
for await (const chunk of stream) {
// Process each chunk
}
Theoretical Basis
Streaming uses async iteration (for-await-of) over ChatCompletionChunk objects produced by an AsyncGenerator function. The generator:
- Runs the prefill step and yields the first chunk
- Enters the decode loop, yielding a chunk after each decode step
- After the loop terminates, yields a final chunk with
finish_reasonset - Optionally yields a usage statistics chunk
Each chunk contains choices[].delta.content with the incremental text. The consumer concatenates these deltas to build the full response. The stream terminates when finish_reason is non-null, signaling one of:
- "stop" -- Natural stop token or stop sequence encountered
- "length" --
max_tokenslimit or context window exhausted - "tool_calls" -- Function calling output completed (rewrites finish reason from "stop" to "tool_calls" when tools are present)
- "abort" -- User interrupted generation via
engine.interruptGenerate()
Usage statistics (total tokens, prefill/decode throughput, latency breakdowns) are available in the final usage chunk when stream_options: { include_usage: true } is set. In non-streaming mode, they appear directly in the ChatCompletion.usage field.
All chunks in a single request share the same id (a UUID) and created timestamp, enabling consumers to correlate chunks belonging to the same generation.