Implementation:Vllm project Vllm OpenAI Streaming Client
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, Streaming, Real-Time Systems |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for consuming streaming chat completion responses from a vLLM server provided by the openai Python SDK.
Description
When stream=True is passed to client.chat.completions.create(), the OpenAI SDK returns a Stream[ChatCompletionChunk] iterator instead of a single ChatCompletion object. Each chunk represents an incremental piece of the model's response, delivered in real-time via Server-Sent Events (SSE).
On the server side, vLLM's FastAPI endpoint sends each generated token (or batch of tokens, controlled by stream_interval) as an SSE event. The OpenAI SDK transparently handles the SSE parsing, deserializing each event into a typed ChatCompletionChunk object.
Each chunk's content is found at chunk.choices[0].delta.content. The delta object is sparse: it contains only the fields that changed since the last chunk. The first chunk typically includes the role, middle chunks include content fragments, and the final chunk has an empty delta with a finish_reason.
This is a wrapper around the external openai SDK. The vLLM project provides streaming examples in examples/online_serving/openai_chat_completion_client.py and examples/online_serving/openai_chat_completion_with_reasoning_streaming.py.
Usage
Use streaming when building interactive applications where time-to-first-token matters. The client iterates over chunks and processes each content delta as it arrives. Always check that delta.content is not None before using it, as some chunks carry only metadata (role assignment, finish reason).
Code Reference
Source Location
- Repository: openai-python (client SDK)
- File: External SDK; vLLM streaming example at
examples/online_serving/openai_chat_completion_with_reasoning_streaming.py - Server-side SSE:
vllm/entrypoints/openai/(FastAPI SSE endpoints)
Signature
# Enable streaming by passing stream=True
stream = client.chat.completions.create(
model: str,
messages: list[dict],
stream: bool = True,
temperature: float = 1.0,
max_tokens: int | None = None,
...
) -> Stream[ChatCompletionChunk]
# Iterate over the stream
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end="", flush=True)
Import
from openai import OpenAI
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | str |
Yes | Name of the model being served by the vLLM server. |
| messages | list[dict] |
Yes | Conversation history as role/content message dicts. |
| stream | bool |
Yes | Must be set to True to enable streaming.
|
| temperature | float |
No | Sampling temperature. Default: 1.0. |
| max_tokens | None | No | Maximum tokens to generate. Default: model-specific. |
| top_p | float |
No | Nucleus sampling threshold. Default: 1.0. |
| stop | list[str] | None | No | Stop sequences. Default: None. |
Outputs
| Name | Type | Description |
|---|---|---|
| Stream[ChatCompletionChunk] | Iterator |
An iterable of chunk objects, one per SSE event. |
| chunk.choices[0].delta.content | None | The incremental text fragment for this chunk. None for metadata-only chunks. |
| chunk.choices[0].delta.role | None | The role (typically "assistant"), present only in the first chunk. |
| chunk.choices[0].finish_reason | None | Set to "stop" or "length" in the final chunk; None for intermediate chunks. |
| chunk.id | str |
Unique identifier for the completion, consistent across all chunks. |
| chunk.model | str |
The model that generated the response. |
Usage Examples
Basic Streaming Chat
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
stream = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": "Explain quantum computing in simple terms."},
],
stream=True,
max_tokens=512,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end="", flush=True)
print() # Final newline
Collecting Full Response from Stream
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Write a haiku about Python."}],
stream=True,
)
# Accumulate the full response
full_response = []
for chunk in stream:
content = chunk.choices[0].delta.content
if content is not None:
full_response.append(content)
print("".join(full_response))
Streaming with Reasoning Models
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "9.11 and 9.8, which is greater?"}],
stream=True,
)
for chunk in stream:
# Reasoning models may include a reasoning field on the delta
reasoning = getattr(chunk.choices[0].delta, "reasoning", None)
content = chunk.choices[0].delta.content
if reasoning:
print(f"[reasoning] {reasoning}", end="", flush=True)
elif content:
print(content, end="", flush=True)