Implementation:Lm sys FastChat OpenAI API Server

Field	Value
Page Type	Implementation (API Doc)
Repository	lm-sys/FastChat
Domain	REST API Design, API Compatibility, Streaming Protocols
Knowledge Sources	Source code analysis of `fastchat/serve/openai_api_server.py`, `fastchat/protocol/openai_api_protocol.py`
Last Updated	2026-02-07 14:00 GMT
Implements	Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving

Overview

This page documents the OpenAI-compatible API server implemented in FastChat. The server provides REST endpoints that mirror the OpenAI API specification, allowing existing OpenAI client applications to interact with self-hosted language models without code changes. It handles request validation, worker routing via the controller, conversation template application, streaming and non-streaming response generation, embeddings, and API key authentication.

Description

The OpenAI API server is a FastAPI application that translates between the OpenAI REST API protocol and FastChat's internal worker-based inference system. For each request, it validates the model and parameters, obtains a worker address from the controller, constructs generation parameters (including applying the model-specific conversation template), forwards the request to the worker, and formats the response in OpenAI-compatible JSON.

The server uses aiohttp for async communication with the controller and httpx for streaming responses from workers. The AppSettings configuration holds the controller address and optional API keys. CORS middleware is added based on CLI parameters.

Key protocol models are defined in fastchat.protocol.openai_api_protocol, including ChatCompletionRequest, ChatCompletionResponse, CompletionRequest, CompletionResponse, EmbeddingsRequest, and EmbeddingsResponse.

Usage

Start the API server from the command line:

python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --controller-address http://localhost:21001

Use programmatically:

from fastchat.serve.openai_api_server import create_openai_api_server

args = create_openai_api_server()

Code Reference

Source Location

Component	File	Lines
create_chat_completion endpoint	`fastchat/serve/openai_api_server.py`	L411-483
create_openai_api_server factory	`fastchat/serve/openai_api_server.py`	L878-924
chat_completion_stream_generator	`fastchat/serve/openai_api_server.py`	L486-539
create_completion endpoint	`fastchat/serve/openai_api_server.py`	L542-618
create_embeddings endpoint	`fastchat/serve/openai_api_server.py`	L706-751
show_available_models endpoint	`fastchat/serve/openai_api_server.py`	L397-408
get_gen_params helper	`fastchat/serve/openai_api_server.py`	L266-364
check_api_key dependency	`fastchat/serve/openai_api_server.py`	L109-128
AppSettings config	`fastchat/serve/openai_api_server.py`	L97-100
ChatCompletionRequest model	`fastchat/protocol/openai_api_protocol.py`	L58-74
ChatCompletionResponse model	`fastchat/protocol/openai_api_protocol.py`	L88-94

Signature

def create_openai_api_server() -> argparse.Namespace: ...

# Key endpoint handlers
async def create_chat_completion(request: ChatCompletionRequest) -> Union[ChatCompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_completion(request: CompletionRequest) -> Union[CompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_embeddings(request: EmbeddingsRequest, model_name: str = None) -> Union[dict, JSONResponse]: ...
async def show_available_models() -> ModelList: ...

# Internal helpers
async def get_gen_params(
    model_name: str,
    worker_addr: str,
    messages: Union[str, List[Dict[str, str]]],
    *,
    temperature: float,
    top_p: float,
    top_k: Optional[int],
    presence_penalty: Optional[float],
    frequency_penalty: Optional[float],
    max_tokens: Optional[int],
    echo: Optional[bool],
    logprobs: Optional[int] = None,
    stop: Optional[Union[str, List[str]]],
    best_of: Optional[int] = None,
    use_beam_search: Optional[bool] = None,
) -> Dict[str, Any]: ...

async def get_worker_address(model_name: str) -> str: ...
async def check_model(request) -> Optional[JSONResponse]: ...
async def check_length(request, prompt, max_tokens, worker_addr) -> Tuple[int, Optional[JSONResponse]]: ...

Import

from fastchat.serve.openai_api_server import create_openai_api_server
from fastchat.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    CompletionRequest,
    CompletionResponse,
    EmbeddingsRequest,
    EmbeddingsResponse,
    UsageInfo,
)

I/O Contract

CLI Parameters

Parameter	Type	Default	Description
`--host`	str	`"localhost"`	Host address to bind the API server
`--port`	int	`8000`	Port number for the API server
`--controller-address`	str	`"http://localhost:21001"`	Address of the FastChat controller
`--api-keys`	str	None	Comma-separated list of valid API keys
`--allow-credentials`	flag	`False`	Allow CORS credentials
`--allowed-origins`	JSON list	`["*"]`	Allowed CORS origins
`--allowed-methods`	JSON list	`["*"]`	Allowed CORS methods
`--allowed-headers`	JSON list	`["*"]`	Allowed CORS headers
`--ssl`	flag	`False`	Enable SSL (requires `SSL_KEYFILE` and `SSL_CERTFILE` env vars)

REST API Routes

Method	Route	Auth	Request Body	Response
GET	`/v1/models`	API key	None	`ModelList` with `data: List[ModelCard]`
POST	`/v1/chat/completions`	API key	`ChatCompletionRequest`	`ChatCompletionResponse` or SSE stream
POST	`/v1/completions`	API key	`CompletionRequest`	`CompletionResponse` or SSE stream
POST	`/v1/embeddings`	API key	`EmbeddingsRequest`	`EmbeddingsResponse`
POST	`/v1/engines/{model_name}/embeddings`	API key	`EmbeddingsRequest`	`EmbeddingsResponse`
POST	`/api/v1/token_check`	None	`APITokenCheckRequest`	`APITokenCheckResponse`
POST	`/api/v1/chat/completions`	None	`APIChatCompletionRequest`	`ChatCompletionResponse` or SSE stream

ChatCompletionRequest Fields

Field	Type	Default	Description
`model`	str	(required)	Model identifier
`messages`	List[Dict]	(required)	Conversation messages with role and content
`temperature`	float	0.7	Sampling temperature
`top_p`	float	1.0	Nucleus sampling threshold
`top_k`	int	-1	Top-k sampling (-1 to disable)
`n`	int	1	Number of completions to generate
`max_tokens`	int	None	Maximum tokens to generate
`stop`	str or List[str]	None	Stop sequence(s)
`stream`	bool	False	Enable SSE streaming
`presence_penalty`	float	0.0	Presence penalty
`frequency_penalty`	float	0.0	Frequency penalty

ChatCompletionResponse Fields

Field	Type	Description
`id`	str	Unique completion ID (format: `chatcmpl-{shortuuid}`)
`object`	str	Always `"chat.completion"`
`created`	int	Unix timestamp
`model`	str	Model identifier used
`choices`	List[ChatCompletionResponseChoice]	Each has `index`, `message` (role, content), `finish_reason`
`usage`	UsageInfo	`prompt_tokens`, `completion_tokens`, `total_tokens`

Request Processing Flow (create_chat_completion)

check_model -- Verify the requested model exists via controller's /list_models
check_requests -- Validate parameter ranges (max_tokens > 0, 0 <= temperature <= 2, 0 <= top_p <= 1, etc.)
get_worker_address -- Obtain a worker address from the controller via /get_worker_address
get_gen_params -- Fetch conversation template from the worker, apply messages to template, construct generation parameters dict
check_length -- Verify prompt + max_tokens fits within the model's context window via worker's /count_token and /model_details
Dispatch -- If stream=true, return StreamingResponse from chat_completion_stream_generator; otherwise, gather n async completions and return ChatCompletionResponse

Error Handling

Error Code	Constant	Condition
INVALID_MODEL	`ErrorCode.INVALID_MODEL`	Requested model not in controller's model list
PARAM_OUT_OF_RANGE	`ErrorCode.PARAM_OUT_OF_RANGE`	Parameter validation failure (temperature, top_p, max_tokens, etc.)
CONTEXT_OVERFLOW	`ErrorCode.CONTEXT_OVERFLOW`	Prompt tokens exceed model's context length
INTERNAL_ERROR	`ErrorCode.INTERNAL_ERROR`	Worker-side error or async task failure
401 Unauthorized	HTTP status	Invalid or missing API key when keys are configured

Usage Examples

Starting the API Server

# Basic startup
python3 -m fastchat.serve.openai_api_server

# With API key authentication
python3 -m fastchat.serve.openai_api_server \
    --api-keys "sk-key1,sk-key2"

# With custom CORS and SSL
SSL_KEYFILE=/path/to/key.pem SSL_CERTFILE=/path/to/cert.pem \
    python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 \
    --port 443 \
    --ssl \
    --allowed-origins '["https://myapp.example.com"]'

Chat Completion (Non-Streaming)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}],
    "temperature": 0.7
  }'

Example response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1707307200,
  "model": "vicuna-7b-v1.5",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I am Vicuna, a language model trained by researchers from LMSYS."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 22,
    "total_tokens": 34
  }
}

Chat Completion (Streaming)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "stream": true
  }'

Example streamed response:

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}

data: [DONE]

Related Pages

Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving
Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- The principle this implementation realizes
Implementation:Lm_sys_FastChat_Controller_Dispatch -- The controller used for model validation and worker routing
Implementation:Lm_sys_FastChat_ModelWorker_Load_And_Generate -- The workers that process forwarded requests
Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client -- Client-side usage patterns for this API
Environment:Lm_sys_FastChat_GPU_CUDA_Inference
Environment:Lm_sys_FastChat_API_Keys_And_Credentials

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment