Principle:Lm sys FastChat OpenAI Compatible API Serving

Field	Value
Page Type	Principle
Repository	lm-sys/FastChat
Domain	REST API Design, API Compatibility, Streaming Protocols
Knowledge Sources	Source code analysis of `fastchat/serve/openai_api_server.py`, `fastchat/protocol/openai_api_protocol.py`
Last Updated	2026-02-07 14:00 GMT
Implemented By	Implementation:Lm_sys_FastChat_OpenAI_API_Server

Overview

OpenAI Compatible API Serving is the principle of providing a self-hosted REST API that mirrors the OpenAI API specification, enabling existing applications built for OpenAI's services to work with locally-hosted language models with minimal or no code changes. FastChat implements this compatibility layer as a FastAPI server that translates OpenAI-format requests into internal generation parameters, routes them through the controller to model workers, and formats responses to match the OpenAI response schema. This principle makes self-hosted LLM inference a drop-in replacement for cloud-based API services.

Description

API Compatibility Layer

The OpenAI-compatible API server exposes endpoints that mirror the structure and semantics of the OpenAI REST API:

/v1/chat/completions -- Chat-style completions with message history (system, user, assistant roles). This is the primary endpoint for conversational AI applications.
/v1/completions -- Text completion from a prompt string. Supports echo, logprobs, and best-of parameters.
/v1/models -- List available models. Returns model cards in OpenAI format.
/v1/embeddings -- Compute text embeddings for semantic similarity, search, and clustering tasks.

Each endpoint accepts request bodies that conform to the OpenAI API schema (e.g., ChatCompletionRequest with model, messages, temperature, top_p, max_tokens, stream, etc.) and returns responses in the corresponding OpenAI format (e.g., ChatCompletionResponse with id, choices, usage).

SSE Streaming

For streaming responses, the server uses Server-Sent Events (SSE) following the same protocol as OpenAI's streaming API. When stream=true is set in the request:

The server returns a text/event-stream response
Each event is a line prefixed with data: containing a JSON chunk
Chat completion chunks use the delta field (with role in the first chunk, content in subsequent chunks)
The stream terminates with data: [DONE]

This streaming protocol enables real-time token display in client applications, providing a responsive user experience even for long generations.

CORS Configuration

The server supports configurable Cross-Origin Resource Sharing (CORS) middleware, enabling browser-based JavaScript applications to directly call the API. CORS settings are specified at startup:

Allowed origins -- Which domains can make requests (default: all)
Allowed methods -- Which HTTP methods are permitted (default: all)
Allowed headers -- Which headers are accepted (default: all)
Allow credentials -- Whether cookies and auth headers are forwarded

API Key Authentication

The server supports optional API key authentication via Bearer tokens. When API keys are configured:

All protected endpoints require an Authorization: Bearer <key> header
Requests without a valid key receive a 401 Unauthorized response with an OpenAI-format error body
When no API keys are configured, all requests are allowed (open access)

This enables basic access control for deployments exposed to untrusted networks.

Request Routing Through Controller to Workers

The API server does not perform inference directly. Instead, it follows a multi-step routing process for each request:

Model validation -- Check with the controller that the requested model exists
Worker address resolution -- Query the controller for an available worker address via /get_worker_address
Parameter construction -- Translate OpenAI request parameters into FastChat's internal generation parameters, including conversation template application
Context length validation -- Check with the worker that the prompt fits within the model's context window
Request forwarding -- Send the generation request to the worker and stream or collect the response
Response formatting -- Wrap the worker's output in OpenAI-format response objects

This separation of concerns allows the API server to be stateless and horizontally scalable, while the controller handles load balancing and the workers handle computation.

Conversation Template Application

A key aspect of the compatibility layer is translating OpenAI-style message arrays into model-specific prompt formats. The API server:

Retrieves the conversation template from the worker for the requested model
Applies system messages, user messages, and assistant messages to the template
Generates the final prompt string with appropriate separators and role markers
Handles multimodal content (images) by extracting image URLs and formatting them for vision models

This ensures that each model receives input in its expected format, regardless of the uniform OpenAI request structure.

Usage

The OpenAI-compatible API server is the primary interface for programmatic access to FastChat models. It is deployed as part of the standard three-process architecture:

Start the controller: python3 -m fastchat.serve.controller
Start model worker(s): python3 -m fastchat.serve.model_worker --model-path <path>
Start the API server: python3 -m fastchat.serve.openai_api_server --port 8000

Applications then interact with the API server exactly as they would with OpenAI's API, simply by changing the base URL.

Theoretical Basis

API Compatibility / Facade Pattern -- The server acts as a facade that presents a standardized interface (OpenAI API) while delegating to a different internal implementation. This pattern enables ecosystem reuse: any tool, library, or application built for the OpenAI API can work with FastChat without modification.

Server-Sent Events (SSE) -- SSE is a W3C standard for server-to-client streaming over HTTP. Unlike WebSockets, SSE is unidirectional (server to client), uses standard HTTP, works through proxies, and supports automatic reconnection. OpenAI chose SSE for streaming completions, and FastChat mirrors this choice.

Separation of Concerns -- The three-process architecture (API server, controller, worker) separates the API interface, routing logic, and computation into independent services. This enables independent scaling and failure isolation.

Stateless API Server -- The API server holds no inference state between requests. All model state resides in workers, and all routing state resides in the controller. This makes the API server trivially scalable behind a load balancer.

Related Pages

Implementation:Lm_sys_FastChat_OpenAI_API_Server
Implementation:Lm_sys_FastChat_OpenAI_API_Server -- API documentation for the OpenAI API server implementation
Principle:Lm_sys_FastChat_Worker_Dispatch_Control -- Controller dispatch that the API server relies on for worker routing
Principle:Lm_sys_FastChat_OpenAI_Client_Interaction -- How clients interact with this API
Principle:Lm_sys_FastChat_Model_Worker_Inference -- The worker-side inference that processes forwarded requests

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment