Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Mlc ai Web llm Streaming Response Processing

From Leeroopedia

Overview

Streaming Response Processing is the technique of consuming incremental token-by-token output from language model inference and assembling it into coherent responses. Rather than waiting for the full generation to complete, streaming delivers each generated token (or small group of tokens) as soon as it is available, enabling real-time UI updates and reduced perceived latency.

Description

Streaming response processing consumes the output of autoregressive decoding as individual token chunks rather than waiting for complete generation. The processing model works as follows:

Streaming Mode

When stream: true is set in the request, the inference engine returns a Promise<AsyncIterable<ChatCompletionChunk>>. Each chunk contains:

  • choices[0].delta.content -- The incremental text generated since the last chunk
  • choices[0].delta.role -- Set to "assistant" in the first chunk
  • choices[0].finish_reason -- null for intermediate chunks; one of "stop", "length", "tool_calls", or "abort" for the final chunk
  • choices[0].logprobs -- Log probability information if logprobs: true was set in the request

The stream produces the following sequence of chunks:

  1. Prefill chunk -- First chunk after the prefill phase completes, containing the first generated token
  2. Decode chunks -- One chunk per decode step, each containing the next generated token (skipping chunks when incomplete multi-byte characters like emojis are detected)
  3. Final chunk -- Contains an empty delta (or tool_calls if function calling) with a non-null finish_reason
  4. Usage chunk (optional) -- If stream_options: { include_usage: true } is set, a final chunk with empty choices and populated usage statistics

Non-Streaming Mode

When stream is false or unset, the engine buffers all generated tokens internally and returns a single ChatCompletion object containing:

  • choices[0].message.content -- The complete generated text
  • choices[0].message.role -- Always "assistant"
  • choices[0].finish_reason -- The termination reason
  • usage -- Complete token statistics and performance metrics

Unicode Handling

The streaming implementation includes special handling for multi-byte Unicode characters (such as emojis). Since each emoji is composed of multiple tokens, partially decoded emojis appear as the Unicode replacement character (U+FFFD). The engine detects trailing replacement characters and skips yielding a chunk until the full character is decoded, preventing broken character display in the UI.

Usage

  • Use streaming when building interactive UIs that display tokens as they are generated. This reduces perceived latency because the user sees the first token after only the prefill phase, rather than waiting for the entire generation.
  • Use non-streaming when the complete response is needed before processing, such as JSON parsing, tool call extraction, or batch processing scenarios.

Streaming is consumed via the for await...of pattern on the returned async iterable:

const stream = await engine.chat.completions.create({
  messages: [{ role: "user", content: "Hello" }],
  stream: true,
});
for await (const chunk of stream) {
  // Process each chunk
}

Theoretical Basis

Streaming uses async iteration (for-await-of) over ChatCompletionChunk objects produced by an AsyncGenerator function. The generator:

  1. Runs the prefill step and yields the first chunk
  2. Enters the decode loop, yielding a chunk after each decode step
  3. After the loop terminates, yields a final chunk with finish_reason set
  4. Optionally yields a usage statistics chunk

Each chunk contains choices[].delta.content with the incremental text. The consumer concatenates these deltas to build the full response. The stream terminates when finish_reason is non-null, signaling one of:

  • "stop" -- Natural stop token or stop sequence encountered
  • "length" -- max_tokens limit or context window exhausted
  • "tool_calls" -- Function calling output completed (rewrites finish reason from "stop" to "tool_calls" when tools are present)
  • "abort" -- User interrupted generation via engine.interruptGenerate()

Usage statistics (total tokens, prefill/decode throughput, latency breakdowns) are available in the final usage chunk when stream_options: { include_usage: true } is set. In non-streaming mode, they appear directly in the ChatCompletion.usage field.

All chunks in a single request share the same id (a UUID) and created timestamp, enabling consumers to correlate chunks belonging to the same generation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment