Implementation:Lm sys FastChat OpenAI Chat Completion Client

Field	Value
Page Type	Implementation (Pattern Doc)
Repository	lm-sys/FastChat
Domain	API Client Design, Chat Completion Protocol, Streaming Consumption
Knowledge Sources	Source code analysis of `tests/test_openai_api.py`, `fastchat/protocol/openai_api_protocol.py`
Last Updated	2026-02-07 14:00 GMT
Implements	Principle:Lm_sys_FastChat_OpenAI_Client_Interaction

Overview

This is a Pattern Doc that documents the user-defined interface for interacting with FastChat's OpenAI-compatible API using the OpenAI Python SDK and cURL. Because FastChat implements the OpenAI REST API specification, clients use the standard openai Python package with a custom base_url pointing to the FastChat server. This page provides concrete examples for chat completions (streaming and non-streaming), text completions, embeddings, and model listing.

Description

The OpenAI Chat Completion Client pattern demonstrates how to interact with FastChat as a drop-in replacement for the OpenAI API. The key configuration change is setting base_url to the FastChat server's address (e.g., http://localhost:8000/v1/) and api_key to any string (or a valid key if authentication is configured).

The pattern covers:

Model listing -- Enumerate available models via openai.models.list()
Chat completions -- Send conversation messages and receive assistant responses
Streaming chat completions -- Receive tokens incrementally via SSE
Text completions -- Prompt-based text generation with logprobs support
Embeddings -- Compute vector representations of text
cURL equivalents -- Raw HTTP requests for non-Python clients

All request and response formats match the OpenAI API specification exactly.

Usage

Install the required package:

pip install openai

Configure the client:

import openai

openai.api_key = "EMPTY"  # Or a configured API key
openai.base_url = "http://localhost:8000/v1/"

Code Reference

Source Location

Component	File	Lines
Test examples (all client patterns)	`tests/test_openai_api.py`	L1-149
ChatCompletionRequest schema	`fastchat/protocol/openai_api_protocol.py`	L58-74
ChatCompletionResponse schema	`fastchat/protocol/openai_api_protocol.py`	L88-94
ChatMessage schema	`fastchat/protocol/openai_api_protocol.py`	L77-79
UsageInfo schema	`fastchat/protocol/openai_api_protocol.py`	L45-48
CompletionRequest schema	`fastchat/protocol/openai_api_protocol.py`	L151-168
EmbeddingsRequest schema	`fastchat/protocol/openai_api_protocol.py`	L136-141

Signature

The client interface is provided by the openai Python package. Key methods:

# Chat completions
openai.chat.completions.create(
    model: str,
    messages: List[Dict[str, str]],  # [{"role": "user", "content": "..."}]
    temperature: float = 0.7,
    top_p: float = 1.0,
    max_tokens: Optional[int] = None,
    stream: bool = False,
    stop: Optional[Union[str, List[str]]] = None,
    n: int = 1,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
) -> ChatCompletion | Stream[ChatCompletionChunk]

# Text completions
openai.completions.create(
    model: str,
    prompt: str,
    max_tokens: int = 16,
    temperature: float = 0.7,
    top_p: float = 1.0,
    logprobs: Optional[int] = None,
    echo: bool = False,
    stream: bool = False,
    stop: Optional[Union[str, List[str]]] = None,
) -> Completion | Stream[Completion]

# Embeddings
openai.embeddings.create(
    model: str,
    input: Union[str, List[str]],
) -> CreateEmbeddingResponse

# Model listing
openai.models.list() -> SyncPage[Model]

Import

import openai

I/O Contract

Client Configuration

Parameter	Value	Description
`openai.api_key`	`"EMPTY"` or valid key	API key for authentication (required by SDK, use any string if auth is disabled)
`openai.base_url`	`"http://localhost:8000/v1/"`	Base URL pointing to FastChat API server

Chat Completion Request Parameters

Parameter	Type	Default	Description
`model`	str	(required)	Model identifier (e.g., `"vicuna-7b-v1.5"`)
`messages`	List[Dict]	(required)	Conversation messages, each with `role` and `content`
`temperature`	float	0.7	Sampling temperature (0 = greedy, higher = more random)
`top_p`	float	1.0	Nucleus sampling threshold
`max_tokens`	int	None	Maximum tokens to generate
`stream`	bool	False	Enable streaming SSE response
`stop`	str or List[str]	None	Stop sequence(s)
`n`	int	1	Number of completions to generate
`presence_penalty`	float	0.0	Penalize tokens based on presence in text so far
`frequency_penalty`	float	0.0	Penalize tokens based on frequency in text so far

Chat Completion Response Structure

Field	Type	Description
`id`	str	Unique ID (e.g., `"chatcmpl-abc123"`)
`object`	str	`"chat.completion"`
`created`	int	Unix timestamp of creation
`model`	str	Model used for generation
`choices`	List	Each choice has: `index` (int), `message` (`{"role": "assistant", "content": str}`), `finish_reason` (`"stop"` or `"length"`)
`usage`	Dict	`{"prompt_tokens": int, "completion_tokens": int, "total_tokens": int}`

Streaming Chunk Structure

Field	Type	Description
`id`	str	Same ID across all chunks in a stream
`object`	str	`"chat.completion.chunk"`
`choices`	List	Each has: `index`, `delta` (`{"role": "assistant"}` or `{"content": str}`), `finish_reason`

Usage Examples

List Available Models

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

model_list = openai.models.list()
names = [x.id for x in model_list.data]
print(f"Available models: {names}")

Chat Completion (Non-Streaming)

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

completion = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=[{"role": "user", "content": "Hello! What is your name?"}],
    temperature=0,
)
print(completion.choices[0].message.content)
# Output: "Hello! I am Vicuna, a language model..."

# Access usage information
print(f"Prompt tokens: {completion.usage.prompt_tokens}")
print(f"Completion tokens: {completion.usage.completion_tokens}")
print(f"Total tokens: {completion.usage.total_tokens}")

Chat Completion (Streaming)

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

messages = [{"role": "user", "content": "Hello! What is your name?"}]
response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    stream=True,
    temperature=0,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
print()

Text Completion with Logprobs

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

completion = openai.completions.create(
    model="vicuna-7b-v1.5",
    prompt="Once upon a time",
    logprobs=1,
    max_tokens=64,
    temperature=0,
)

print(f"Generated text: Once upon a time{completion.choices[0].text}")
if completion.choices[0].logprobs is not None:
    print(f"Token logprobs: {completion.choices[0].logprobs.token_logprobs[:10]}")

Streaming Text Completion

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

response = openai.completions.create(
    model="vicuna-7b-v1.5",
    prompt="Once upon a time",
    max_tokens=64,
    stream=True,
    temperature=0,
)

print("Once upon a time", end="")
for chunk in response:
    content = chunk.choices[0].text
    print(content, end="", flush=True)
print()

Embeddings

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

embedding = openai.embeddings.create(
    model="vicuna-7b-v1.5",
    input="Hello world!",
)

print(f"Embedding dimension: {len(embedding.data[0].embedding)}")
print(f"First 5 values: {embedding.data[0].embedding[:5]}")

cURL Examples

List models:

curl http://localhost:8000/v1/models

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'

Text completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }'

Embeddings:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "input": "Hello world!"
  }'

With API key authentication:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Multi-Turn Conversation

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

# First turn
response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    temperature=0,
)

assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}")

# Second turn -- include history
messages.append({"role": "assistant", "content": assistant_reply})
messages.append({"role": "user", "content": "And what is its population?"})

response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    temperature=0,
)

print(f"Assistant: {response.choices[0].message.content}")

Related Pages

Principle:Lm_sys_FastChat_OpenAI_Client_Interaction
Principle:Lm_sys_FastChat_OpenAI_Client_Interaction -- The principle this pattern document illustrates
Implementation:Lm_sys_FastChat_OpenAI_API_Server -- The server that handles these client requests
Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- Server-side API compatibility principle
Environment:Lm_sys_FastChat_API_Keys_And_Credentials

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment