Implementation:Vllm project Vllm OpenAI Chat Completions

Knowledge Sources	vLLM OpenAI Python SDK vLLM Docs
Domains	LLM Serving, API Integration, Client Libraries
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for sending chat completion requests to a vLLM server provided by the openai Python SDK.

Description

The OpenAI Python SDK's client.chat.completions.create() method sends a chat completion request to any OpenAI-compatible API endpoint. When pointed at a vLLM server via the base_url parameter, it constructs an HTTP POST request to /v1/chat/completions, serializes the message history and parameters as JSON, and deserializes the response into typed Python objects.

The vLLM server processes the request through its FastAPI frontend, applies the model's chat template to convert messages into a prompt, runs inference through the engine, and returns a structured response matching the OpenAI specification.

This is a wrapper around the external openai SDK rather than a vLLM-internal API. The vLLM project provides example clients in examples/online_serving/openai_chat_completion_client.py.

Usage

Use this API to interact with a running vLLM server from any Python application. The client can be configured once and reused across multiple requests. The model name must match what the vLLM server is serving (discoverable via client.models.list()).

Code Reference

Source Location

Repository: openai-python (client SDK)
File: External SDK; vLLM example at examples/online_serving/openai_chat_completion_client.py
Server-side handler: vllm/entrypoints/openai/ (FastAPI route handlers)

Signature

# Client instantiation
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

# Chat completions call
response = client.chat.completions.create(
    model: str,
    messages: list[dict],
    temperature: float = 1.0,
    top_p: float = 1.0,
    max_tokens: int | None = 16,
    stream: bool = False,
    n: int = 1,
    stop: str | list[str] | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    logprobs: bool | None = None,
    top_logprobs: int | None = None,
) -> ChatCompletion

Import

from openai import OpenAI

I/O Contract

Inputs

Name	Type	Required	Description
base_url	`str`	Yes	URL of the vLLM server's v1 API (e.g., "http://localhost:8000/v1").
api_key	`str`	Yes	API key matching the server's --api-key flag, or any string if auth is disabled.
model	`str`	Yes	Name of the model being served. Use `client.models.list()` to discover.
messages	`list[dict]`	Yes	Conversation history as a list of `{"role": str, "content": str}` dicts. Roles: "system", "user", "assistant".
temperature	`float`	No	Sampling temperature (0.0 = greedy, higher = more random). Default: 1.0.
max_tokens	None	No	Maximum number of tokens to generate. Default: 16.
top_p	`float`	No	Nucleus sampling threshold. Default: 1.0.
stream	`bool`	No	If True, return a streaming iterator of chunks. Default: False.
n	`int`	No	Number of completions to generate. Default: 1.
stop	list[str] \| None	No	Stop sequences that halt generation. Default: None.

Outputs

Name	Type	Description
ChatCompletion	`openai.types.chat.ChatCompletion`	Structured response containing generated message(s), finish reason, and usage statistics.
choices[0].message.content	`str`	The generated text content from the assistant.
choices[0].finish_reason	`str`	Why generation stopped: "stop" (natural end or stop sequence) or "length" (max_tokens reached).
usage.prompt_tokens	`int`	Number of tokens in the input prompt.
usage.completion_tokens	`int`	Number of tokens in the generated completion.
usage.total_tokens	`int`	Total tokens consumed (prompt + completion).

Usage Examples

Basic Chat Completion

from openai import OpenAI

# Point the client at a running vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# Discover the served model
models = client.models.list()
model = models.data[0].id

# Send a chat completion request
response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Multi-Turn Conversation

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant",
     "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"},
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
)

print(response.choices[0].message.content)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_OpenAI_Client_Integration

Requires Environment

Environment:Vllm_project_Vllm_Python_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment