Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm OpenAI Streaming Client

From Leeroopedia
Revision as of 17:06, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Vllm_project_Vllm_OpenAI_Streaming_Client.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLM Serving, Streaming, Real-Time Systems
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for consuming streaming chat completion responses from a vLLM server provided by the openai Python SDK.

Description

When stream=True is passed to client.chat.completions.create(), the OpenAI SDK returns a Stream[ChatCompletionChunk] iterator instead of a single ChatCompletion object. Each chunk represents an incremental piece of the model's response, delivered in real-time via Server-Sent Events (SSE).

On the server side, vLLM's FastAPI endpoint sends each generated token (or batch of tokens, controlled by stream_interval) as an SSE event. The OpenAI SDK transparently handles the SSE parsing, deserializing each event into a typed ChatCompletionChunk object.

Each chunk's content is found at chunk.choices[0].delta.content. The delta object is sparse: it contains only the fields that changed since the last chunk. The first chunk typically includes the role, middle chunks include content fragments, and the final chunk has an empty delta with a finish_reason.

This is a wrapper around the external openai SDK. The vLLM project provides streaming examples in examples/online_serving/openai_chat_completion_client.py and examples/online_serving/openai_chat_completion_with_reasoning_streaming.py.

Usage

Use streaming when building interactive applications where time-to-first-token matters. The client iterates over chunks and processes each content delta as it arrives. Always check that delta.content is not None before using it, as some chunks carry only metadata (role assignment, finish reason).

Code Reference

Source Location

  • Repository: openai-python (client SDK)
  • File: External SDK; vLLM streaming example at examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
  • Server-side SSE: vllm/entrypoints/openai/ (FastAPI SSE endpoints)

Signature

# Enable streaming by passing stream=True
stream = client.chat.completions.create(
    model: str,
    messages: list[dict],
    stream: bool = True,
    temperature: float = 1.0,
    max_tokens: int | None = None,
    ...
) -> Stream[ChatCompletionChunk]

# Iterate over the stream
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)

Import

from openai import OpenAI

I/O Contract

Inputs

Name Type Required Description
model str Yes Name of the model being served by the vLLM server.
messages list[dict] Yes Conversation history as role/content message dicts.
stream bool Yes Must be set to True to enable streaming.
temperature float No Sampling temperature. Default: 1.0.
max_tokens None No Maximum tokens to generate. Default: model-specific.
top_p float No Nucleus sampling threshold. Default: 1.0.
stop list[str] | None No Stop sequences. Default: None.

Outputs

Name Type Description
Stream[ChatCompletionChunk] Iterator An iterable of chunk objects, one per SSE event.
chunk.choices[0].delta.content None The incremental text fragment for this chunk. None for metadata-only chunks.
chunk.choices[0].delta.role None The role (typically "assistant"), present only in the first chunk.
chunk.choices[0].finish_reason None Set to "stop" or "length" in the final chunk; None for intermediate chunks.
chunk.id str Unique identifier for the completion, consistent across all chunks.
chunk.model str The model that generated the response.

Usage Examples

Basic Streaming Chat

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms."},
    ],
    stream=True,
    max_tokens=512,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
print()  # Final newline

Collecting Full Response from Stream

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
    stream=True,
)

# Accumulate the full response
full_response = []
for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
        full_response.append(content)

print("".join(full_response))

Streaming with Reasoning Models

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "9.11 and 9.8, which is greater?"}],
    stream=True,
)

for chunk in stream:
    # Reasoning models may include a reasoning field on the delta
    reasoning = getattr(chunk.choices[0].delta, "reasoning", None)
    content = chunk.choices[0].delta.content

    if reasoning:
        print(f"[reasoning] {reasoning}", end="", flush=True)
    elif content:
        print(content, end="", flush=True)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment