Implementation:Sgl project Sglang Chat Completions Streaming
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Streaming, API_Design |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for handling streaming chat completion responses from the SGLang server via Server-Sent Events.
Description
When stream=True is passed to the /v1/chat/completions endpoint, the server returns an SSE stream of ChatCompletionChunk objects. Each chunk contains delta.content with the next portion of generated text. The OpenAI Python SDK handles SSE parsing and exposes chunks as an iterator. SGLang also supports stream_options for including usage statistics in the final chunk.
Usage
Set stream=True in chat completion requests when building interactive applications. Iterate over the response to process tokens as they arrive.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/entrypoints/http_server.py
- Lines: L1324-1331 (same endpoint as non-streaming, with stream=True)
- Streaming logic: python/sglang/srt/entrypoints/openai/serving_base.py
Signature
# Client-side (OpenAI SDK)
stream = client.chat.completions.create(
model: str,
messages: List[Dict],
stream: bool = True,
stream_options: Optional[Dict] = None, # e.g., {"include_usage": True}
**sampling_params,
) -> Stream[ChatCompletionChunk]
# Each chunk has:
# chunk.choices[0].delta.content -> Optional[str]
Import
import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="EMPTY")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | Model name |
| messages | List[Dict] | Yes | Conversation messages |
| stream | bool | Yes | Must be True for streaming |
| stream_options | Optional[Dict] | No | Options like {"include_usage": True} |
Outputs
| Name | Type | Description |
|---|---|---|
| Stream[ChatCompletionChunk] | Iterator | Yields chunks with delta.content containing generated text fragments |
Usage Examples
Basic Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a short story."}],
stream=True,
max_tokens=256,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # Newline at end
With Usage Statistics
stream = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain AI."}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
if chunk.usage:
print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")