Implementation:Sgl project Sglang Chat Completions Streaming

Knowledge Sources	SGLang
Domains	LLM_Serving, Streaming, API_Design
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for handling streaming chat completion responses from the SGLang server via Server-Sent Events.

Description

When stream=True is passed to the /v1/chat/completions endpoint, the server returns an SSE stream of ChatCompletionChunk objects. Each chunk contains delta.content with the next portion of generated text. The OpenAI Python SDK handles SSE parsing and exposes chunks as an iterator. SGLang also supports stream_options for including usage statistics in the final chunk.

Usage

Set stream=True in chat completion requests when building interactive applications. Iterate over the response to process tokens as they arrive.

Code Reference

Source Location

Repository: sglang
File: python/sglang/srt/entrypoints/http_server.py
Lines: L1324-1331 (same endpoint as non-streaming, with stream=True)
Streaming logic: python/sglang/srt/entrypoints/openai/serving_base.py

Signature

# Client-side (OpenAI SDK)
stream = client.chat.completions.create(
    model: str,
    messages: List[Dict],
    stream: bool = True,
    stream_options: Optional[Dict] = None,  # e.g., {"include_usage": True}
    **sampling_params,
) -> Stream[ChatCompletionChunk]

# Each chunk has:
# chunk.choices[0].delta.content -> Optional[str]

Import

import openai
client = openai.Client(base_url="http://localhost:30000/v1", api_key="EMPTY")

I/O Contract

Inputs

Name	Type	Required	Description
model	str	Yes	Model name
messages	List[Dict]	Yes	Conversation messages
stream	bool	Yes	Must be True for streaming
stream_options	Optional[Dict]	No	Options like {"include_usage": True}

Outputs

Name	Type	Description
Stream[ChatCompletionChunk]	Iterator	Yields chunks with delta.content containing generated text fragments

Usage Examples

Basic Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a short story."}],
    stream=True,
    max_tokens=256,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()  # Newline at end

With Usage Statistics

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain AI."}],
    stream=True,
    stream_options={"include_usage": True},
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
    if chunk.usage:
        print(f"\nTokens: {chunk.usage.prompt_tokens} + {chunk.usage.completion_tokens}")

Related Pages

Implements Principle

Principle:Sgl_project_Sglang_Streaming_Response_Handling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment