Principle:Sgl project Sglang Streaming Response Handling
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Streaming, API_Design |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A real-time token delivery mechanism that sends generated tokens incrementally to clients via Server-Sent Events (SSE) as they are produced.
Description
Streaming response handling allows LLM clients to receive generated tokens in real-time rather than waiting for the complete response. This dramatically reduces perceived latency for long generations. SGLang implements streaming via the HTTP SSE protocol: when stream=True is set, the server sends ChatCompletionChunk objects as data events, each containing the next generated token(s) in delta.content. The stream terminates with a [DONE] event.
Usage
Enable streaming for interactive applications (chatbots, coding assistants) where users benefit from seeing partial responses immediately. Streaming is essential for long-form generation where waiting for the complete response would be unacceptable.
Theoretical Basis
Server-Sent Events (SSE) is an HTTP-based unidirectional streaming protocol:
- Client sends a single HTTP POST request with stream: true
- Server responds with Content-Type: text/event-stream
- Each token is sent as a data: event with a JSON ChatCompletionChunk
- The stream ends with data: [DONE]
Advantages over WebSockets:
- Simpler protocol — standard HTTP
- Automatic reconnection
- Compatible with proxies and load balancers