Principle:Lm sys FastChat OpenAI Client Interaction
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | lm-sys/FastChat |
| Domain | API Client Design, Chat Completion Protocol, Streaming Consumption |
| Knowledge Sources | Source code analysis of tests/test_openai_api.py, fastchat/protocol/openai_api_protocol.py
|
| Last Updated | 2026-02-07 14:00 GMT |
| Implemented By | Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client |
Overview
OpenAI Client Interaction is the principle governing how client applications interact with FastChat's OpenAI-compatible API server. Because FastChat faithfully implements the OpenAI REST API specification, clients can use the official OpenAI Python SDK (or any HTTP client) to communicate with FastChat, treating it as a drop-in replacement for the OpenAI service. This principle covers the chat completion message format, streaming consumption via Server-Sent Events, error handling, and token usage tracking.
Description
Drop-In Replacement Concept
The fundamental idea behind OpenAI Client Interaction in FastChat is that any application using the OpenAI API can switch to a self-hosted FastChat backend by changing only two configuration values:
base_url-- Point to the FastChat API server instead ofhttps://api.openai.com/v1api_key-- Set to any string (or a configured key if API key authentication is enabled)
No other code changes are required. The request and response schemas, streaming protocol, and error formats are identical. This makes FastChat suitable for:
- Local development and testing without API costs
- Air-gapped deployments where cloud API access is not available
- Privacy-sensitive use cases where data cannot leave the organization
- Research and experimentation with open-source models
Chat Completion Format
The chat completion interface uses a message-based format where each message has:
role-- One of"system","user", or"assistant"content-- The text content of the message
Messages are provided as an ordered list representing the conversation history. The system message (optional) sets the assistant's behavior. User messages contain the human input. Assistant messages represent previous model outputs (for multi-turn conversations).
The server applies the model's conversation template to these messages, converting the uniform format into model-specific prompt formatting (e.g., Vicuna's USER: ... ASSISTANT: ... format, or Llama-2's [INST] ... [/INST] format).
Streaming with SSE
When stream=true is set, the response arrives as a sequence of Server-Sent Events. Each event contains a JSON chunk with the delta field instead of the message field:
- The first chunk contains
delta: {"role": "assistant"}(no content) - Subsequent chunks contain
delta: {"content": "token text"} - The final chunk has a non-null
finish_reason("stop" or "length") - The stream ends with
data: [DONE]
Clients iterate over these chunks to display tokens as they arrive. The OpenAI Python SDK handles SSE parsing automatically, exposing chunks as iterable objects.
Error Handling
The API returns structured error responses that match the OpenAI error format:
- Invalid model -- 400 status with message indicating which models are available
- Parameter out of range -- 400 status with specific parameter validation error
- Context overflow -- 400 status when the prompt exceeds the model's context length
- Invalid API key -- 401 status with
invalid_api_keyerror code - No available worker -- Internal error when no worker can serve the requested model
Clients should handle these errors gracefully, typically by checking the response status code and parsing the error body.
Token Usage Tracking
Every non-streaming response includes a usage object with:
prompt_tokens-- Number of tokens in the input promptcompletion_tokens-- Number of tokens generatedtotal_tokens-- Sum of prompt and completion tokens
This enables clients to track token consumption for cost estimation, context window management, and performance monitoring. Note that in streaming mode, usage information is not included in individual chunks (consistent with OpenAI's behavior).
Usage
Client interaction with FastChat follows the standard OpenAI SDK patterns:
- Install the OpenAI Python package:
pip install openai - Configure the client to point to the FastChat server
- Use the same API calls as for OpenAI (chat completions, completions, embeddings, model listing)
Both synchronous and asynchronous clients are supported. The cURL command-line tool can also be used for testing and simple integrations.
Theoretical Basis
- API Compatibility as Migration Strategy -- By providing API-level compatibility, FastChat eliminates the switching cost for applications migrating from cloud to self-hosted inference. This follows the industry pattern of "wire-compatible" alternatives (e.g., MinIO for S3, CockroachDB for PostgreSQL).
- Message-Based Chat Protocol -- The role-tagged message format originated with OpenAI's ChatGPT API and has become the de facto standard for conversational AI interfaces. The format cleanly separates conversation structure from model-specific prompt engineering.
- Server-Sent Events for Streaming -- SSE provides a simple, HTTP-native streaming mechanism. Unlike WebSockets, SSE requires no connection upgrade, works through HTTP proxies, and supports automatic browser reconnection. The
data: [DONE]sentinel follows the convention established by OpenAI.
- Structured Error Responses -- Returning machine-readable error objects (with code, message, and type fields) enables programmatic error handling in client applications, following REST API best practices.
Related Pages
- Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client
- Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client -- Concrete code examples for interacting with the API
- Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- The server-side principle that makes this client interaction possible
- Implementation:Lm_sys_FastChat_OpenAI_API_Server -- The server implementation that handles these client requests