Implementation:Vllm project Vllm OpenAI Streaming Client

Knowledge Sources	vLLM OpenAI Python SDK vLLM Docs
Domains	LLM Serving, Streaming, Real-Time Systems
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for consuming streaming chat completion responses from a vLLM server provided by the openai Python SDK.

Description

When stream=True is passed to client.chat.completions.create(), the OpenAI SDK returns a Stream[ChatCompletionChunk] iterator instead of a single ChatCompletion object. Each chunk represents an incremental piece of the model's response, delivered in real-time via Server-Sent Events (SSE).

On the server side, vLLM's FastAPI endpoint sends each generated token (or batch of tokens, controlled by stream_interval) as an SSE event. The OpenAI SDK transparently handles the SSE parsing, deserializing each event into a typed ChatCompletionChunk object.

Each chunk's content is found at chunk.choices[0].delta.content. The delta object is sparse: it contains only the fields that changed since the last chunk. The first chunk typically includes the role, middle chunks include content fragments, and the final chunk has an empty delta with a finish_reason.

This is a wrapper around the external openai SDK. The vLLM project provides streaming examples in examples/online_serving/openai_chat_completion_client.py and examples/online_serving/openai_chat_completion_with_reasoning_streaming.py.

Usage

Use streaming when building interactive applications where time-to-first-token matters. The client iterates over chunks and processes each content delta as it arrives. Always check that delta.content is not None before using it, as some chunks carry only metadata (role assignment, finish reason).

Code Reference

Source Location

Repository: openai-python (client SDK)
File: External SDK; vLLM streaming example at examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
Server-side SSE: vllm/entrypoints/openai/ (FastAPI SSE endpoints)

Signature

# Enable streaming by passing stream=True
stream = client.chat.completions.create(
    model: str,
    messages: list[dict],
    stream: bool = True,
    temperature: float = 1.0,
    max_tokens: int | None = None,
    ...
) -> Stream[ChatCompletionChunk]

# Iterate over the stream
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)

Import

from openai import OpenAI

I/O Contract

Inputs

Name	Type	Required	Description
model	`str`	Yes	Name of the model being served by the vLLM server.
messages	`list[dict]`	Yes	Conversation history as role/content message dicts.
stream	`bool`	Yes	Must be set to `True` to enable streaming.
temperature	`float`	No	Sampling temperature. Default: 1.0.
max_tokens	None	No	Maximum tokens to generate. Default: model-specific.
top_p	`float`	No	Nucleus sampling threshold. Default: 1.0.
stop	list[str] \| None	No	Stop sequences. Default: None.

Outputs

Name	Type	Description
Stream[ChatCompletionChunk]	`Iterator`	An iterable of chunk objects, one per SSE event.
chunk.choices[0].delta.content	None	The incremental text fragment for this chunk. None for metadata-only chunks.
chunk.choices[0].delta.role	None	The role (typically "assistant"), present only in the first chunk.
chunk.choices[0].finish_reason	None	Set to "stop" or "length" in the final chunk; None for intermediate chunks.
chunk.id	`str`	Unique identifier for the completion, consistent across all chunks.
chunk.model	`str`	The model that generated the response.

Usage Examples

Basic Streaming Chat

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."},
    ],
    stream=True,
    max_tokens=512,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
print()  # Final newline

Collecting Full Response from Stream

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
    stream=True,
)

# Accumulate the full response
full_response = []
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        full_response.append(content)

print("".join(full_response))

Streaming with Reasoning Models

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "9.11 and 9.8, which is greater?"}],
    stream=True,
)

for chunk in stream:
    # Reasoning models may include a reasoning field on the delta
    reasoning = getattr(chunk.choices[0].delta, "reasoning", None)
    content = chunk.choices[0].delta.content

    if reasoning:
        print(f"[reasoning] {reasoning}", end="", flush=True)
    elif content:
        print(content, end="", flush=True)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Streaming_Response_Handling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment