Implementation:Lm sys FastChat OpenAI API Server
| Field | Value |
|---|---|
| Page Type | Implementation (API Doc) |
| Repository | lm-sys/FastChat |
| Domain | REST API Design, API Compatibility, Streaming Protocols |
| Knowledge Sources | Source code analysis of fastchat/serve/openai_api_server.py, fastchat/protocol/openai_api_protocol.py
|
| Last Updated | 2026-02-07 14:00 GMT |
| Implements | Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving |
Overview
This page documents the OpenAI-compatible API server implemented in FastChat. The server provides REST endpoints that mirror the OpenAI API specification, allowing existing OpenAI client applications to interact with self-hosted language models without code changes. It handles request validation, worker routing via the controller, conversation template application, streaming and non-streaming response generation, embeddings, and API key authentication.
Description
The OpenAI API server is a FastAPI application that translates between the OpenAI REST API protocol and FastChat's internal worker-based inference system. For each request, it validates the model and parameters, obtains a worker address from the controller, constructs generation parameters (including applying the model-specific conversation template), forwards the request to the worker, and formats the response in OpenAI-compatible JSON.
The server uses aiohttp for async communication with the controller and httpx for streaming responses from workers. The AppSettings configuration holds the controller address and optional API keys. CORS middleware is added based on CLI parameters.
Key protocol models are defined in fastchat.protocol.openai_api_protocol, including ChatCompletionRequest, ChatCompletionResponse, CompletionRequest, CompletionResponse, EmbeddingsRequest, and EmbeddingsResponse.
Usage
Start the API server from the command line:
python3 -m fastchat.serve.openai_api_server \
--host 0.0.0.0 \
--port 8000 \
--controller-address http://localhost:21001
Use programmatically:
from fastchat.serve.openai_api_server import create_openai_api_server
args = create_openai_api_server()
Code Reference
Source Location
| Component | File | Lines |
|---|---|---|
| create_chat_completion endpoint | fastchat/serve/openai_api_server.py |
L411-483 |
| create_openai_api_server factory | fastchat/serve/openai_api_server.py |
L878-924 |
| chat_completion_stream_generator | fastchat/serve/openai_api_server.py |
L486-539 |
| create_completion endpoint | fastchat/serve/openai_api_server.py |
L542-618 |
| create_embeddings endpoint | fastchat/serve/openai_api_server.py |
L706-751 |
| show_available_models endpoint | fastchat/serve/openai_api_server.py |
L397-408 |
| get_gen_params helper | fastchat/serve/openai_api_server.py |
L266-364 |
| check_api_key dependency | fastchat/serve/openai_api_server.py |
L109-128 |
| AppSettings config | fastchat/serve/openai_api_server.py |
L97-100 |
| ChatCompletionRequest model | fastchat/protocol/openai_api_protocol.py |
L58-74 |
| ChatCompletionResponse model | fastchat/protocol/openai_api_protocol.py |
L88-94 |
Signature
def create_openai_api_server() -> argparse.Namespace: ...
# Key endpoint handlers
async def create_chat_completion(request: ChatCompletionRequest) -> Union[ChatCompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_completion(request: CompletionRequest) -> Union[CompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_embeddings(request: EmbeddingsRequest, model_name: str = None) -> Union[dict, JSONResponse]: ...
async def show_available_models() -> ModelList: ...
# Internal helpers
async def get_gen_params(
model_name: str,
worker_addr: str,
messages: Union[str, List[Dict[str, str]]],
*,
temperature: float,
top_p: float,
top_k: Optional[int],
presence_penalty: Optional[float],
frequency_penalty: Optional[float],
max_tokens: Optional[int],
echo: Optional[bool],
logprobs: Optional[int] = None,
stop: Optional[Union[str, List[str]]],
best_of: Optional[int] = None,
use_beam_search: Optional[bool] = None,
) -> Dict[str, Any]: ...
async def get_worker_address(model_name: str) -> str: ...
async def check_model(request) -> Optional[JSONResponse]: ...
async def check_length(request, prompt, max_tokens, worker_addr) -> Tuple[int, Optional[JSONResponse]]: ...
Import
from fastchat.serve.openai_api_server import create_openai_api_server
from fastchat.protocol.openai_api_protocol import (
ChatCompletionRequest,
ChatCompletionResponse,
CompletionRequest,
CompletionResponse,
EmbeddingsRequest,
EmbeddingsResponse,
UsageInfo,
)
I/O Contract
CLI Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--host |
str | "localhost" |
Host address to bind the API server |
--port |
int | 8000 |
Port number for the API server |
--controller-address |
str | "http://localhost:21001" |
Address of the FastChat controller |
--api-keys |
str | None | Comma-separated list of valid API keys |
--allow-credentials |
flag | False |
Allow CORS credentials |
--allowed-origins |
JSON list | ["*"] |
Allowed CORS origins |
--allowed-methods |
JSON list | ["*"] |
Allowed CORS methods |
--allowed-headers |
JSON list | ["*"] |
Allowed CORS headers |
--ssl |
flag | False |
Enable SSL (requires SSL_KEYFILE and SSL_CERTFILE env vars)
|
REST API Routes
| Method | Route | Auth | Request Body | Response |
|---|---|---|---|---|
| GET | /v1/models |
API key | None | ModelList with data: List[ModelCard]
|
| POST | /v1/chat/completions |
API key | ChatCompletionRequest |
ChatCompletionResponse or SSE stream
|
| POST | /v1/completions |
API key | CompletionRequest |
CompletionResponse or SSE stream
|
| POST | /v1/embeddings |
API key | EmbeddingsRequest |
EmbeddingsResponse
|
| POST | /v1/engines/{model_name}/embeddings |
API key | EmbeddingsRequest |
EmbeddingsResponse
|
| POST | /api/v1/token_check |
None | APITokenCheckRequest |
APITokenCheckResponse
|
| POST | /api/v1/chat/completions |
None | APIChatCompletionRequest |
ChatCompletionResponse or SSE stream
|
ChatCompletionRequest Fields
| Field | Type | Default | Description |
|---|---|---|---|
model |
str | (required) | Model identifier |
messages |
List[Dict] | (required) | Conversation messages with role and content |
temperature |
float | 0.7 | Sampling temperature |
top_p |
float | 1.0 | Nucleus sampling threshold |
top_k |
int | -1 | Top-k sampling (-1 to disable) |
n |
int | 1 | Number of completions to generate |
max_tokens |
int | None | Maximum tokens to generate |
stop |
str or List[str] | None | Stop sequence(s) |
stream |
bool | False | Enable SSE streaming |
presence_penalty |
float | 0.0 | Presence penalty |
frequency_penalty |
float | 0.0 | Frequency penalty |
ChatCompletionResponse Fields
| Field | Type | Description |
|---|---|---|
id |
str | Unique completion ID (format: chatcmpl-{shortuuid})
|
object |
str | Always "chat.completion"
|
created |
int | Unix timestamp |
model |
str | Model identifier used |
choices |
List[ChatCompletionResponseChoice] | Each has index, message (role, content), finish_reason
|
usage |
UsageInfo | prompt_tokens, completion_tokens, total_tokens
|
Request Processing Flow (create_chat_completion)
- check_model -- Verify the requested model exists via controller's
/list_models - check_requests -- Validate parameter ranges (max_tokens > 0, 0 <= temperature <= 2, 0 <= top_p <= 1, etc.)
- get_worker_address -- Obtain a worker address from the controller via
/get_worker_address - get_gen_params -- Fetch conversation template from the worker, apply messages to template, construct generation parameters dict
- check_length -- Verify prompt + max_tokens fits within the model's context window via worker's
/count_tokenand/model_details - Dispatch -- If
stream=true, returnStreamingResponsefromchat_completion_stream_generator; otherwise, gathernasync completions and returnChatCompletionResponse
Error Handling
| Error Code | Constant | Condition |
|---|---|---|
| INVALID_MODEL | ErrorCode.INVALID_MODEL |
Requested model not in controller's model list |
| PARAM_OUT_OF_RANGE | ErrorCode.PARAM_OUT_OF_RANGE |
Parameter validation failure (temperature, top_p, max_tokens, etc.) |
| CONTEXT_OVERFLOW | ErrorCode.CONTEXT_OVERFLOW |
Prompt tokens exceed model's context length |
| INTERNAL_ERROR | ErrorCode.INTERNAL_ERROR |
Worker-side error or async task failure |
| 401 Unauthorized | HTTP status | Invalid or missing API key when keys are configured |
Usage Examples
Starting the API Server
# Basic startup
python3 -m fastchat.serve.openai_api_server
# With API key authentication
python3 -m fastchat.serve.openai_api_server \
--api-keys "sk-key1,sk-key2"
# With custom CORS and SSL
SSL_KEYFILE=/path/to/key.pem SSL_CERTFILE=/path/to/cert.pem \
python3 -m fastchat.serve.openai_api_server \
--host 0.0.0.0 \
--port 443 \
--ssl \
--allowed-origins '["https://myapp.example.com"]'
Chat Completion (Non-Streaming)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Hello! What is your name?"}],
"temperature": 0.7
}'
Example response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1707307200,
"model": "vicuna-7b-v1.5",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! I am Vicuna, a language model trained by researchers from LMSYS."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 22,
"total_tokens": 34
}
}
Chat Completion (Streaming)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-7b-v1.5",
"messages": [{"role": "user", "content": "Tell me a joke."}],
"stream": true
}'
Example streamed response:
data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}
data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}
data: [DONE]
Related Pages
- Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving
- Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- The principle this implementation realizes
- Implementation:Lm_sys_FastChat_Controller_Dispatch -- The controller used for model validation and worker routing
- Implementation:Lm_sys_FastChat_ModelWorker_Load_And_Generate -- The workers that process forwarded requests
- Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client -- Client-side usage patterns for this API
- Environment:Lm_sys_FastChat_GPU_CUDA_Inference
- Environment:Lm_sys_FastChat_API_Keys_And_Credentials