Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm OpenAI Chat Completions

From Leeroopedia


Knowledge Sources
Domains LLM Serving, API Integration, Client Libraries
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for sending chat completion requests to a vLLM server provided by the openai Python SDK.

Description

The OpenAI Python SDK's client.chat.completions.create() method sends a chat completion request to any OpenAI-compatible API endpoint. When pointed at a vLLM server via the base_url parameter, it constructs an HTTP POST request to /v1/chat/completions, serializes the message history and parameters as JSON, and deserializes the response into typed Python objects.

The vLLM server processes the request through its FastAPI frontend, applies the model's chat template to convert messages into a prompt, runs inference through the engine, and returns a structured response matching the OpenAI specification.

This is a wrapper around the external openai SDK rather than a vLLM-internal API. The vLLM project provides example clients in examples/online_serving/openai_chat_completion_client.py.

Usage

Use this API to interact with a running vLLM server from any Python application. The client can be configured once and reused across multiple requests. The model name must match what the vLLM server is serving (discoverable via client.models.list()).

Code Reference

Source Location

  • Repository: openai-python (client SDK)
  • File: External SDK; vLLM example at examples/online_serving/openai_chat_completion_client.py
  • Server-side handler: vllm/entrypoints/openai/ (FastAPI route handlers)

Signature

# Client instantiation
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

# Chat completions call
response = client.chat.completions.create(
    model: str,
    messages: list[dict],
    temperature: float = 1.0,
    top_p: float = 1.0,
    max_tokens: int | None = 16,
    stream: bool = False,
    n: int = 1,
    stop: str | list[str] | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    logprobs: bool | None = None,
    top_logprobs: int | None = None,
) -> ChatCompletion

Import

from openai import OpenAI

I/O Contract

Inputs

Name Type Required Description
base_url str Yes URL of the vLLM server's v1 API (e.g., "http://localhost:8000/v1").
api_key str Yes API key matching the server's --api-key flag, or any string if auth is disabled.
model str Yes Name of the model being served. Use client.models.list() to discover.
messages list[dict] Yes Conversation history as a list of {"role": str, "content": str} dicts. Roles: "system", "user", "assistant".
temperature float No Sampling temperature (0.0 = greedy, higher = more random). Default: 1.0.
max_tokens None No Maximum number of tokens to generate. Default: 16.
top_p float No Nucleus sampling threshold. Default: 1.0.
stream bool No If True, return a streaming iterator of chunks. Default: False.
n int No Number of completions to generate. Default: 1.
stop list[str] | None No Stop sequences that halt generation. Default: None.

Outputs

Name Type Description
ChatCompletion openai.types.chat.ChatCompletion Structured response containing generated message(s), finish reason, and usage statistics.
choices[0].message.content str The generated text content from the assistant.
choices[0].finish_reason str Why generation stopped: "stop" (natural end or stop sequence) or "length" (max_tokens reached).
usage.prompt_tokens int Number of tokens in the input prompt.
usage.completion_tokens int Number of tokens in the generated completion.
usage.total_tokens int Total tokens consumed (prompt + completion).

Usage Examples

Basic Chat Completion

from openai import OpenAI

# Point the client at a running vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

# Discover the served model
models = client.models.list()
model = models.data[0].id

# Send a chat completion request
response = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)

Multi-Turn Conversation

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant",
     "content": "The Los Angeles Dodgers won the World Series in 2020."},
    {"role": "user", "content": "Where was it played?"},
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
)

print(response.choices[0].message.content)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment