Principle:Lm sys FastChat OpenAI Compatible API Serving
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | lm-sys/FastChat |
| Domain | REST API Design, API Compatibility, Streaming Protocols |
| Knowledge Sources | Source code analysis of fastchat/serve/openai_api_server.py, fastchat/protocol/openai_api_protocol.py
|
| Last Updated | 2026-02-07 14:00 GMT |
| Implemented By | Implementation:Lm_sys_FastChat_OpenAI_API_Server |
Overview
OpenAI Compatible API Serving is the principle of providing a self-hosted REST API that mirrors the OpenAI API specification, enabling existing applications built for OpenAI's services to work with locally-hosted language models with minimal or no code changes. FastChat implements this compatibility layer as a FastAPI server that translates OpenAI-format requests into internal generation parameters, routes them through the controller to model workers, and formats responses to match the OpenAI response schema. This principle makes self-hosted LLM inference a drop-in replacement for cloud-based API services.
Description
API Compatibility Layer
The OpenAI-compatible API server exposes endpoints that mirror the structure and semantics of the OpenAI REST API:
/v1/chat/completions-- Chat-style completions with message history (system, user, assistant roles). This is the primary endpoint for conversational AI applications./v1/completions-- Text completion from a prompt string. Supports echo, logprobs, and best-of parameters./v1/models-- List available models. Returns model cards in OpenAI format./v1/embeddings-- Compute text embeddings for semantic similarity, search, and clustering tasks.
Each endpoint accepts request bodies that conform to the OpenAI API schema (e.g., ChatCompletionRequest with model, messages, temperature, top_p, max_tokens, stream, etc.) and returns responses in the corresponding OpenAI format (e.g., ChatCompletionResponse with id, choices, usage).
SSE Streaming
For streaming responses, the server uses Server-Sent Events (SSE) following the same protocol as OpenAI's streaming API. When stream=true is set in the request:
- The server returns a
text/event-streamresponse - Each event is a line prefixed with
data:containing a JSON chunk - Chat completion chunks use the
deltafield (with role in the first chunk, content in subsequent chunks) - The stream terminates with
data: [DONE]
This streaming protocol enables real-time token display in client applications, providing a responsive user experience even for long generations.
CORS Configuration
The server supports configurable Cross-Origin Resource Sharing (CORS) middleware, enabling browser-based JavaScript applications to directly call the API. CORS settings are specified at startup:
- Allowed origins -- Which domains can make requests (default: all)
- Allowed methods -- Which HTTP methods are permitted (default: all)
- Allowed headers -- Which headers are accepted (default: all)
- Allow credentials -- Whether cookies and auth headers are forwarded
API Key Authentication
The server supports optional API key authentication via Bearer tokens. When API keys are configured:
- All protected endpoints require an
Authorization: Bearer <key>header - Requests without a valid key receive a 401 Unauthorized response with an OpenAI-format error body
- When no API keys are configured, all requests are allowed (open access)
This enables basic access control for deployments exposed to untrusted networks.
Request Routing Through Controller to Workers
The API server does not perform inference directly. Instead, it follows a multi-step routing process for each request:
- Model validation -- Check with the controller that the requested model exists
- Worker address resolution -- Query the controller for an available worker address via
/get_worker_address - Parameter construction -- Translate OpenAI request parameters into FastChat's internal generation parameters, including conversation template application
- Context length validation -- Check with the worker that the prompt fits within the model's context window
- Request forwarding -- Send the generation request to the worker and stream or collect the response
- Response formatting -- Wrap the worker's output in OpenAI-format response objects
This separation of concerns allows the API server to be stateless and horizontally scalable, while the controller handles load balancing and the workers handle computation.
Conversation Template Application
A key aspect of the compatibility layer is translating OpenAI-style message arrays into model-specific prompt formats. The API server:
- Retrieves the conversation template from the worker for the requested model
- Applies system messages, user messages, and assistant messages to the template
- Generates the final prompt string with appropriate separators and role markers
- Handles multimodal content (images) by extracting image URLs and formatting them for vision models
This ensures that each model receives input in its expected format, regardless of the uniform OpenAI request structure.
Usage
The OpenAI-compatible API server is the primary interface for programmatic access to FastChat models. It is deployed as part of the standard three-process architecture:
- Start the controller:
python3 -m fastchat.serve.controller - Start model worker(s):
python3 -m fastchat.serve.model_worker --model-path <path> - Start the API server:
python3 -m fastchat.serve.openai_api_server --port 8000
Applications then interact with the API server exactly as they would with OpenAI's API, simply by changing the base URL.
Theoretical Basis
- API Compatibility / Facade Pattern -- The server acts as a facade that presents a standardized interface (OpenAI API) while delegating to a different internal implementation. This pattern enables ecosystem reuse: any tool, library, or application built for the OpenAI API can work with FastChat without modification.
- Server-Sent Events (SSE) -- SSE is a W3C standard for server-to-client streaming over HTTP. Unlike WebSockets, SSE is unidirectional (server to client), uses standard HTTP, works through proxies, and supports automatic reconnection. OpenAI chose SSE for streaming completions, and FastChat mirrors this choice.
- Separation of Concerns -- The three-process architecture (API server, controller, worker) separates the API interface, routing logic, and computation into independent services. This enables independent scaling and failure isolation.
- Stateless API Server -- The API server holds no inference state between requests. All model state resides in workers, and all routing state resides in the controller. This makes the API server trivially scalable behind a load balancer.
Related Pages
- Implementation:Lm_sys_FastChat_OpenAI_API_Server
- Implementation:Lm_sys_FastChat_OpenAI_API_Server -- API documentation for the OpenAI API server implementation
- Principle:Lm_sys_FastChat_Worker_Dispatch_Control -- Controller dispatch that the API server relies on for worker routing
- Principle:Lm_sys_FastChat_OpenAI_Client_Interaction -- How clients interact with this API
- Principle:Lm_sys_FastChat_Model_Worker_Inference -- The worker-side inference that processes forwarded requests