Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat OpenAI Chat Completion Client

From Leeroopedia


Field Value
Page Type Implementation (Pattern Doc)
Repository lm-sys/FastChat
Domain API Client Design, Chat Completion Protocol, Streaming Consumption
Knowledge Sources Source code analysis of tests/test_openai_api.py, fastchat/protocol/openai_api_protocol.py
Last Updated 2026-02-07 14:00 GMT
Implements Principle:Lm_sys_FastChat_OpenAI_Client_Interaction

Overview

This is a Pattern Doc that documents the user-defined interface for interacting with FastChat's OpenAI-compatible API using the OpenAI Python SDK and cURL. Because FastChat implements the OpenAI REST API specification, clients use the standard openai Python package with a custom base_url pointing to the FastChat server. This page provides concrete examples for chat completions (streaming and non-streaming), text completions, embeddings, and model listing.

Description

The OpenAI Chat Completion Client pattern demonstrates how to interact with FastChat as a drop-in replacement for the OpenAI API. The key configuration change is setting base_url to the FastChat server's address (e.g., http://localhost:8000/v1/) and api_key to any string (or a valid key if authentication is configured).

The pattern covers:

  • Model listing -- Enumerate available models via openai.models.list()
  • Chat completions -- Send conversation messages and receive assistant responses
  • Streaming chat completions -- Receive tokens incrementally via SSE
  • Text completions -- Prompt-based text generation with logprobs support
  • Embeddings -- Compute vector representations of text
  • cURL equivalents -- Raw HTTP requests for non-Python clients

All request and response formats match the OpenAI API specification exactly.

Usage

Install the required package:

pip install openai

Configure the client:

import openai

openai.api_key = "EMPTY"  # Or a configured API key
openai.base_url = "http://localhost:8000/v1/"

Code Reference

Source Location

Component File Lines
Test examples (all client patterns) tests/test_openai_api.py L1-149
ChatCompletionRequest schema fastchat/protocol/openai_api_protocol.py L58-74
ChatCompletionResponse schema fastchat/protocol/openai_api_protocol.py L88-94
ChatMessage schema fastchat/protocol/openai_api_protocol.py L77-79
UsageInfo schema fastchat/protocol/openai_api_protocol.py L45-48
CompletionRequest schema fastchat/protocol/openai_api_protocol.py L151-168
EmbeddingsRequest schema fastchat/protocol/openai_api_protocol.py L136-141

Signature

The client interface is provided by the openai Python package. Key methods:

# Chat completions
openai.chat.completions.create(
    model: str,
    messages: List[Dict[str, str]],  # [{"role": "user", "content": "..."}]
    temperature: float = 0.7,
    top_p: float = 1.0,
    max_tokens: Optional[int] = None,
    stream: bool = False,
    stop: Optional[Union[str, List[str]]] = None,
    n: int = 1,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
) -> ChatCompletion | Stream[ChatCompletionChunk]

# Text completions
openai.completions.create(
    model: str,
    prompt: str,
    max_tokens: int = 16,
    temperature: float = 0.7,
    top_p: float = 1.0,
    logprobs: Optional[int] = None,
    echo: bool = False,
    stream: bool = False,
    stop: Optional[Union[str, List[str]]] = None,
) -> Completion | Stream[Completion]

# Embeddings
openai.embeddings.create(
    model: str,
    input: Union[str, List[str]],
) -> CreateEmbeddingResponse

# Model listing
openai.models.list() -> SyncPage[Model]

Import

import openai

I/O Contract

Client Configuration

Parameter Value Description
openai.api_key "EMPTY" or valid key API key for authentication (required by SDK, use any string if auth is disabled)
openai.base_url "http://localhost:8000/v1/" Base URL pointing to FastChat API server

Chat Completion Request Parameters

Parameter Type Default Description
model str (required) Model identifier (e.g., "vicuna-7b-v1.5")
messages List[Dict] (required) Conversation messages, each with role and content
temperature float 0.7 Sampling temperature (0 = greedy, higher = more random)
top_p float 1.0 Nucleus sampling threshold
max_tokens int None Maximum tokens to generate
stream bool False Enable streaming SSE response
stop str or List[str] None Stop sequence(s)
n int 1 Number of completions to generate
presence_penalty float 0.0 Penalize tokens based on presence in text so far
frequency_penalty float 0.0 Penalize tokens based on frequency in text so far

Chat Completion Response Structure

Field Type Description
id str Unique ID (e.g., "chatcmpl-abc123")
object str "chat.completion"
created int Unix timestamp of creation
model str Model used for generation
choices List Each choice has: index (int), message ({"role": "assistant", "content": str}), finish_reason ("stop" or "length")
usage Dict {"prompt_tokens": int, "completion_tokens": int, "total_tokens": int}

Streaming Chunk Structure

Field Type Description
id str Same ID across all chunks in a stream
object str "chat.completion.chunk"
choices List Each has: index, delta ({"role": "assistant"} or {"content": str}), finish_reason

Usage Examples

List Available Models

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

model_list = openai.models.list()
names = [x.id for x in model_list.data]
print(f"Available models: {names}")

Chat Completion (Non-Streaming)

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

completion = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=[{"role": "user", "content": "Hello! What is your name?"}],
    temperature=0,
)
print(completion.choices[0].message.content)
# Output: "Hello! I am Vicuna, a language model..."

# Access usage information
print(f"Prompt tokens: {completion.usage.prompt_tokens}")
print(f"Completion tokens: {completion.usage.completion_tokens}")
print(f"Total tokens: {completion.usage.total_tokens}")

Chat Completion (Streaming)

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

messages = [{"role": "user", "content": "Hello! What is your name?"}]
response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    stream=True,
    temperature=0,
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content is not None:
        print(content, end="", flush=True)
print()

Text Completion with Logprobs

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

completion = openai.completions.create(
    model="vicuna-7b-v1.5",
    prompt="Once upon a time",
    logprobs=1,
    max_tokens=64,
    temperature=0,
)

print(f"Generated text: Once upon a time{completion.choices[0].text}")
if completion.choices[0].logprobs is not None:
    print(f"Token logprobs: {completion.choices[0].logprobs.token_logprobs[:10]}")

Streaming Text Completion

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

response = openai.completions.create(
    model="vicuna-7b-v1.5",
    prompt="Once upon a time",
    max_tokens=64,
    stream=True,
    temperature=0,
)

print("Once upon a time", end="")
for chunk in response:
    content = chunk.choices[0].text
    print(content, end="", flush=True)
print()

Embeddings

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

embedding = openai.embeddings.create(
    model="vicuna-7b-v1.5",
    input="Hello world!",
)

print(f"Embedding dimension: {len(embedding.data[0].embedding)}")
print(f"First 5 values: {embedding.data[0].embedding[:5]}")

cURL Examples

List models:

curl http://localhost:8000/v1/models

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'

Text completion:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "prompt": "Once upon a time",
    "max_tokens": 41,
    "temperature": 0.5
  }'

Embeddings:

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "input": "Hello world!"
  }'

With API key authentication:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Multi-Turn Conversation

import openai

openai.api_key = "EMPTY"
openai.base_url = "http://localhost:8000/v1/"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

# First turn
response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    temperature=0,
)

assistant_reply = response.choices[0].message.content
print(f"Assistant: {assistant_reply}")

# Second turn -- include history
messages.append({"role": "assistant", "content": assistant_reply})
messages.append({"role": "user", "content": "And what is its population?"})

response = openai.chat.completions.create(
    model="vicuna-7b-v1.5",
    messages=messages,
    temperature=0,
)

print(f"Assistant: {response.choices[0].message.content}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment