Implementation:Lm sys FastChat OpenAI Chat Completion Client
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Repository | lm-sys/FastChat |
| Domain | API Client Design, Chat Completion Protocol, Streaming Consumption |
| Knowledge Sources | Source code analysis of tests/test_openai_api.py, fastchat/protocol/openai_api_protocol.py
|
| Last Updated | 2026-02-07 14:00 GMT |
| Implements | Principle:Lm_sys_FastChat_OpenAI_Client_Interaction |
Overview
This is a Pattern Doc that documents the user-defined interface for interacting with FastChat's OpenAI-compatible API using the OpenAI Python SDK and cURL. Because FastChat implements the OpenAI REST API specification, clients use the standard openai Python package with a custom base_url pointing to the FastChat server. This page provides concrete examples for chat completions (streaming and non-streaming), text completions, embeddings, and model listing.
Description
The OpenAI Chat Completion Client pattern demonstrates how to interact with FastChat as a drop-in replacement for the OpenAI API. The key configuration change is setting base_url to the FastChat server's address (e.g., http://localhost:8000/v1/) and api_key to any string (or a valid key if authentication is configured).
The pattern covers:
- Model listing -- Enumerate available models via
openai.models.list() - Chat completions -- Send conversation messages and receive assistant responses
- Streaming chat completions -- Receive tokens incrementally via SSE
- Text completions -- Prompt-based text generation with logprobs support
- Embeddings -- Compute vector representations of text
- cURL equivalents -- Raw HTTP requests for non-Python clients
All request and response formats match the OpenAI API specification exactly.
Usage
Install the required package:
pip install openai
Configure the client:
import openai
openai.api_key = "EMPTY" # Or a configured API key
openai.base_url = "http://localhost:8000/v1/"
Code Reference
Source Location
| Component | File | Lines |
|---|---|---|
| Test examples (all client patterns) | tests/test_openai_api.py |
L1-149 |
| ChatCompletionRequest schema | fastchat/protocol/openai_api_protocol.py |
L58-74 |
| ChatCompletionResponse schema | fastchat/protocol/openai_api_protocol.py |
L88-94 |
| ChatMessage schema | fastchat/protocol/openai_api_protocol.py |
L77-79 |
| UsageInfo schema | fastchat/protocol/openai_api_protocol.py |
L45-48 |
| CompletionRequest schema | fastchat/protocol/openai_api_protocol.py |
L151-168 |
| EmbeddingsRequest schema | fastchat/protocol/openai_api_protocol.py |
L136-141 |
Signature
The client interface is provided by the openai Python package. Key methods:
# Chat completions
openai.chat.completions.create(
model: str,
messages: List[Dict[str, str]], # [{"role": "user", "content": "..."}]
temperature: float = 0.7,
top_p: float = 1.0,
max_tokens: Optional[int] = None,
stream: bool = False,
stop: Optional[Union[str, List[str]]] = None,
n: int = 1,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
) -> ChatCompletion | Stream[ChatCompletionChunk]
# Text completions
openai.completions.create(
model: str,
prompt: str,
max_tokens: int = 16,
temperature: float = 0.7,
top_p: float = 1.0,
logprobs: Optional[int] = None,
echo: bool = False,
stream: bool = False,
stop: Optional[Union[str, List[str]]] = None,
) -> Completion | Stream[Completion]
# Embeddings
openai.embeddings.create(
model: str,
input: Union[str, List[str]],
) -> CreateEmbeddingResponse
# Model listing
openai.models.list() -> SyncPage[Model]
Import
import openai
I/O Contract
Client Configuration
| Parameter | Value | Description |
|---|---|---|
openai.api_key |
"EMPTY" or valid key |
API key for authentication (required by SDK, use any string if auth is disabled) |
openai.base_url |
"http://localhost:8000/v1/" |
Base URL pointing to FastChat API server |
Chat Completion Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | (required) | Model identifier (e.g., "vicuna-7b-v1.5")
|
messages |
List[Dict] | (required) | Conversation messages, each with role and content
|
temperature |
float | 0.7 | Sampling temperature (0 = greedy, higher = more random) |
top_p |
float | 1.0 | Nucleus sampling threshold |
max_tokens |
int | None | Maximum tokens to generate |
stream |
bool | False | Enable streaming SSE response |
stop |
str or List[str] | None | Stop sequence(s) |
n |
int | 1 | Number of completions to generate |
presence_penalty |
float | 0.0 | Penalize tokens based on presence in text so far |
frequency_penalty |
float | 0.0 | Penalize tokens based on frequency in text so far |
Chat Completion Response Structure
| Field | Type | Description |
|---|---|---|
id |
str | Unique ID (e.g., "chatcmpl-abc123")
|
object |
str | "chat.completion"
|
created |
int | Unix timestamp of creation |
model |
str | Model used for generation |
choices |
List | Each choice has: index (int), message ({"role": "assistant", "content": str}), finish_reason ("stop" or "length")
|
usage |
Dict | {"prompt_tokens": int, "completion_tokens": int, "total_tokens": int}
|
Streaming Chunk Structure
| Field | Type | Description |
|---|---|---|
id |
str | Same ID across all chunks in a stream |
object |
str | "chat.completion.chunk"
|
choices |
List | Each has: index, delta ({"role": "assistant"} or {"content": str}), finish_reason
|
Usage Examples
List Available Models
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
model_list = openai.models.list()
names = [x.id for x in model_list.data]
print(f"Available models: {names}")
Chat Completion (Non-Streaming)
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
completion = openai.chat.completions.create(
model="vicuna-7b-v1.5",
messages=[{"role": "user", "content": "Hello! What is your name?"}],
temperature=0,
)
print(completion.choices[0].message.content)
# Output: "Hello! I am Vicuna, a language model..."
# Access usage information
print(f"Prompt tokens: {completion.usage.prompt_tokens}")
print(f"Completion tokens: {completion.usage.completion_tokens}")
print(f"Total tokens: {completion.usage.total_tokens}")
Chat Completion (Streaming)
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
messages = [{"role": "user", "content": "Hello! What is your name?"}]
response = openai.chat.completions.create(
model="vicuna-7b-v1.5",
messages=messages,
stream=True,
temperature=0,
)
for chunk in response:
content = chunk.choices[0].delta.content
if content is not None:
print(content, end="", flush=True)
print()
Text Completion with Logprobs
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
completion = openai.completions.create(
model="vicuna-7b-v1.5",
prompt="Once upon a time",
logprobs=1,
max_tokens=64,
temperature=0,
)
print(f"Generated text: Once upon a time{completion.choices[0].text}")
if completion.choices[0].logprobs is not None:
print(f"Token logprobs: {completion.choices[0].logprobs.token_logprobs[:10]}")
Streaming Text Completion
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
response = openai.completions.create(
model="vicuna-7b-v1.5",
prompt="Once upon a time",
max_tokens=64,
stream=True,
temperature=0,
)
print("Once upon a time", end="")
for chunk in response:
content = chunk.choices[0].text
print(content, end="", flush=True)
print()
Embeddings
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
embedding = openai.embeddings.create(
model="vicuna-7b-v1.5",
input="Hello world!",
)
print(f"Embedding dimension: {len(embedding.data[0].embedding)}")
print(f"First 5 values: {embedding.data[0].embedding[:5]}")
cURL Examples
List models:
curl http://localhost:8000/v1/models
Chat completion:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
Text completion:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"prompt": "Once upon a time",
"max_tokens": 41,
"temperature": 0.5
}'
Embeddings:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"input": "Hello world!"
}'
With API key authentication:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-your-api-key" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Multi-Turn Conversation
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
# First turn
response = openai.chat.completions.create(
model="vicuna-7b-v1.5",
messages=messages,
temperature=0,
)
assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}")
# Second turn -- include history
messages.append({"role": "assistant", "content": assistant_reply})
messages.append({"role": "user", "content": "And what is its population?"})
response = openai.chat.completions.create(
model="vicuna-7b-v1.5",
messages=messages,
temperature=0,
)
print(f"Assistant: {response.choices[0].message.content}")
Related Pages
- Principle:Lm_sys_FastChat_OpenAI_Client_Interaction
- Principle:Lm_sys_FastChat_OpenAI_Client_Interaction -- The principle this pattern document illustrates
- Implementation:Lm_sys_FastChat_OpenAI_API_Server -- The server that handles these client requests
- Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- Server-side API compatibility principle
- Environment:Lm_sys_FastChat_API_Keys_And_Credentials