Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ollama Ollama Response Streaming

From Leeroopedia
Knowledge Sources
Domains Systems, Networking, API_Design
Last Updated 2026-02-14 00:00 GMT

Overview

A streaming response delivery mechanism that sends generated tokens to clients incrementally as they are produced, supporting both chunked JSON streaming and non-streaming aggregated responses.

Description

Response Streaming solves the latency problem in LLM inference: rather than waiting for the entire response to be generated (which can take seconds to minutes), tokens are sent to the client as soon as they are sampled. This provides immediate feedback and enables real-time UI rendering of partial responses.

The mechanism supports two modes:

  • Streaming: Each token (or small batch) is sent as an individual JSON object separated by newlines (NDJSON). The final object includes done: true and timing metrics.
  • Non-streaming: All tokens are accumulated internally and sent as a single JSON response when generation completes.

Both the /api/generate and /api/chat endpoints support streaming, with the ChatHandler also supporting tool call extraction, thinking content, and structured output validation.

Usage

Use this principle when designing LLM API endpoints where time-to-first-token latency matters. Streaming should be the default mode for interactive chat applications, while non-streaming is appropriate for batch processing or API clients that prefer complete responses.

Theoretical Basis

Streaming response delivery follows the producer-consumer pattern:

  1. Token Production: The inference engine generates tokens one at a time via the sampling pipeline.
  2. Callback Invocation: Each token triggers a callback function that receives a partial response.
  3. JSON Serialization: The callback serializes the partial response to JSON.
  4. Chunked Transfer: The JSON is written to the HTTP response writer and flushed immediately.
  5. Completion Signal: The final callback includes done=true, total timing metrics, and token counts.

For non-streaming mode, the callback accumulates tokens into a buffer, and only the final response is written to the client.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment