Implementation:Vllm project Vllm OpenAI Chat Completions
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, API Integration, Client Libraries |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for sending chat completion requests to a vLLM server provided by the openai Python SDK.
Description
The OpenAI Python SDK's client.chat.completions.create() method sends a chat completion request to any OpenAI-compatible API endpoint. When pointed at a vLLM server via the base_url parameter, it constructs an HTTP POST request to /v1/chat/completions, serializes the message history and parameters as JSON, and deserializes the response into typed Python objects.
The vLLM server processes the request through its FastAPI frontend, applies the model's chat template to convert messages into a prompt, runs inference through the engine, and returns a structured response matching the OpenAI specification.
This is a wrapper around the external openai SDK rather than a vLLM-internal API. The vLLM project provides example clients in examples/online_serving/openai_chat_completion_client.py.
Usage
Use this API to interact with a running vLLM server from any Python application. The client can be configured once and reused across multiple requests. The model name must match what the vLLM server is serving (discoverable via client.models.list()).
Code Reference
Source Location
- Repository: openai-python (client SDK)
- File: External SDK; vLLM example at
examples/online_serving/openai_chat_completion_client.py - Server-side handler:
vllm/entrypoints/openai/(FastAPI route handlers)
Signature
# Client instantiation
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
# Chat completions call
response = client.chat.completions.create(
model: str,
messages: list[dict],
temperature: float = 1.0,
top_p: float = 1.0,
max_tokens: int | None = 16,
stream: bool = False,
n: int = 1,
stop: str | list[str] | None = None,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
logprobs: bool | None = None,
top_logprobs: int | None = None,
) -> ChatCompletion
Import
from openai import OpenAI
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| base_url | str |
Yes | URL of the vLLM server's v1 API (e.g., "http://localhost:8000/v1"). |
| api_key | str |
Yes | API key matching the server's --api-key flag, or any string if auth is disabled. |
| model | str |
Yes | Name of the model being served. Use client.models.list() to discover.
|
| messages | list[dict] |
Yes | Conversation history as a list of {"role": str, "content": str} dicts. Roles: "system", "user", "assistant".
|
| temperature | float |
No | Sampling temperature (0.0 = greedy, higher = more random). Default: 1.0. |
| max_tokens | None | No | Maximum number of tokens to generate. Default: 16. |
| top_p | float |
No | Nucleus sampling threshold. Default: 1.0. |
| stream | bool |
No | If True, return a streaming iterator of chunks. Default: False. |
| n | int |
No | Number of completions to generate. Default: 1. |
| stop | list[str] | None | No | Stop sequences that halt generation. Default: None. |
Outputs
| Name | Type | Description |
|---|---|---|
| ChatCompletion | openai.types.chat.ChatCompletion |
Structured response containing generated message(s), finish reason, and usage statistics. |
| choices[0].message.content | str |
The generated text content from the assistant. |
| choices[0].finish_reason | str |
Why generation stopped: "stop" (natural end or stop sequence) or "length" (max_tokens reached). |
| usage.prompt_tokens | int |
Number of tokens in the input prompt. |
| usage.completion_tokens | int |
Number of tokens in the generated completion. |
| usage.total_tokens | int |
Total tokens consumed (prompt + completion). |
Usage Examples
Basic Chat Completion
from openai import OpenAI
# Point the client at a running vLLM server
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
# Discover the served model
models = client.models.list()
model = models.data[0].id
# Send a chat completion request
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
Multi-Turn Conversation
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant",
"content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"},
]
response = client.chat.completions.create(
model=model,
messages=messages,
)
print(response.choices[0].message.content)