Principle:Ollama Ollama Response Streaming
| Knowledge Sources | |
|---|---|
| Domains | Systems, Networking, API_Design |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A streaming response delivery mechanism that sends generated tokens to clients incrementally as they are produced, supporting both chunked JSON streaming and non-streaming aggregated responses.
Description
Response Streaming solves the latency problem in LLM inference: rather than waiting for the entire response to be generated (which can take seconds to minutes), tokens are sent to the client as soon as they are sampled. This provides immediate feedback and enables real-time UI rendering of partial responses.
The mechanism supports two modes:
- Streaming: Each token (or small batch) is sent as an individual JSON object separated by newlines (NDJSON). The final object includes done: true and timing metrics.
- Non-streaming: All tokens are accumulated internally and sent as a single JSON response when generation completes.
Both the /api/generate and /api/chat endpoints support streaming, with the ChatHandler also supporting tool call extraction, thinking content, and structured output validation.
Usage
Use this principle when designing LLM API endpoints where time-to-first-token latency matters. Streaming should be the default mode for interactive chat applications, while non-streaming is appropriate for batch processing or API clients that prefer complete responses.
Theoretical Basis
Streaming response delivery follows the producer-consumer pattern:
- Token Production: The inference engine generates tokens one at a time via the sampling pipeline.
- Callback Invocation: Each token triggers a callback function that receives a partial response.
- JSON Serialization: The callback serializes the partial response to JSON.
- Chunked Transfer: The JSON is written to the HTTP response writer and flushed immediately.
- Completion Signal: The final callback includes done=true, total timing metrics, and token counts.
For non-streaming mode, the callback accumulates tokens into a buffer, and only the final response is written to the client.