Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat OpenAI API Server

From Leeroopedia


Field Value
Page Type Implementation (API Doc)
Repository lm-sys/FastChat
Domain REST API Design, API Compatibility, Streaming Protocols
Knowledge Sources Source code analysis of fastchat/serve/openai_api_server.py, fastchat/protocol/openai_api_protocol.py
Last Updated 2026-02-07 14:00 GMT
Implements Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving

Overview

This page documents the OpenAI-compatible API server implemented in FastChat. The server provides REST endpoints that mirror the OpenAI API specification, allowing existing OpenAI client applications to interact with self-hosted language models without code changes. It handles request validation, worker routing via the controller, conversation template application, streaming and non-streaming response generation, embeddings, and API key authentication.

Description

The OpenAI API server is a FastAPI application that translates between the OpenAI REST API protocol and FastChat's internal worker-based inference system. For each request, it validates the model and parameters, obtains a worker address from the controller, constructs generation parameters (including applying the model-specific conversation template), forwards the request to the worker, and formats the response in OpenAI-compatible JSON.

The server uses aiohttp for async communication with the controller and httpx for streaming responses from workers. The AppSettings configuration holds the controller address and optional API keys. CORS middleware is added based on CLI parameters.

Key protocol models are defined in fastchat.protocol.openai_api_protocol, including ChatCompletionRequest, ChatCompletionResponse, CompletionRequest, CompletionResponse, EmbeddingsRequest, and EmbeddingsResponse.

Usage

Start the API server from the command line:

python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 \
    --port 8000 \
    --controller-address http://localhost:21001

Use programmatically:

from fastchat.serve.openai_api_server import create_openai_api_server

args = create_openai_api_server()

Code Reference

Source Location

Component File Lines
create_chat_completion endpoint fastchat/serve/openai_api_server.py L411-483
create_openai_api_server factory fastchat/serve/openai_api_server.py L878-924
chat_completion_stream_generator fastchat/serve/openai_api_server.py L486-539
create_completion endpoint fastchat/serve/openai_api_server.py L542-618
create_embeddings endpoint fastchat/serve/openai_api_server.py L706-751
show_available_models endpoint fastchat/serve/openai_api_server.py L397-408
get_gen_params helper fastchat/serve/openai_api_server.py L266-364
check_api_key dependency fastchat/serve/openai_api_server.py L109-128
AppSettings config fastchat/serve/openai_api_server.py L97-100
ChatCompletionRequest model fastchat/protocol/openai_api_protocol.py L58-74
ChatCompletionResponse model fastchat/protocol/openai_api_protocol.py L88-94

Signature

def create_openai_api_server() -> argparse.Namespace: ...

# Key endpoint handlers
async def create_chat_completion(request: ChatCompletionRequest) -> Union[ChatCompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_completion(request: CompletionRequest) -> Union[CompletionResponse, StreamingResponse, JSONResponse]: ...
async def create_embeddings(request: EmbeddingsRequest, model_name: str = None) -> Union[dict, JSONResponse]: ...
async def show_available_models() -> ModelList: ...

# Internal helpers
async def get_gen_params(
    model_name: str,
    worker_addr: str,
    messages: Union[str, List[Dict[str, str]]],
    *,
    temperature: float,
    top_p: float,
    top_k: Optional[int],
    presence_penalty: Optional[float],
    frequency_penalty: Optional[float],
    max_tokens: Optional[int],
    echo: Optional[bool],
    logprobs: Optional[int] = None,
    stop: Optional[Union[str, List[str]]],
    best_of: Optional[int] = None,
    use_beam_search: Optional[bool] = None,
) -> Dict[str, Any]: ...

async def get_worker_address(model_name: str) -> str: ...
async def check_model(request) -> Optional[JSONResponse]: ...
async def check_length(request, prompt, max_tokens, worker_addr) -> Tuple[int, Optional[JSONResponse]]: ...

Import

from fastchat.serve.openai_api_server import create_openai_api_server
from fastchat.protocol.openai_api_protocol import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    CompletionRequest,
    CompletionResponse,
    EmbeddingsRequest,
    EmbeddingsResponse,
    UsageInfo,
)

I/O Contract

CLI Parameters

Parameter Type Default Description
--host str "localhost" Host address to bind the API server
--port int 8000 Port number for the API server
--controller-address str "http://localhost:21001" Address of the FastChat controller
--api-keys str None Comma-separated list of valid API keys
--allow-credentials flag False Allow CORS credentials
--allowed-origins JSON list ["*"] Allowed CORS origins
--allowed-methods JSON list ["*"] Allowed CORS methods
--allowed-headers JSON list ["*"] Allowed CORS headers
--ssl flag False Enable SSL (requires SSL_KEYFILE and SSL_CERTFILE env vars)

REST API Routes

Method Route Auth Request Body Response
GET /v1/models API key None ModelList with data: List[ModelCard]
POST /v1/chat/completions API key ChatCompletionRequest ChatCompletionResponse or SSE stream
POST /v1/completions API key CompletionRequest CompletionResponse or SSE stream
POST /v1/embeddings API key EmbeddingsRequest EmbeddingsResponse
POST /v1/engines/{model_name}/embeddings API key EmbeddingsRequest EmbeddingsResponse
POST /api/v1/token_check None APITokenCheckRequest APITokenCheckResponse
POST /api/v1/chat/completions None APIChatCompletionRequest ChatCompletionResponse or SSE stream

ChatCompletionRequest Fields

Field Type Default Description
model str (required) Model identifier
messages List[Dict] (required) Conversation messages with role and content
temperature float 0.7 Sampling temperature
top_p float 1.0 Nucleus sampling threshold
top_k int -1 Top-k sampling (-1 to disable)
n int 1 Number of completions to generate
max_tokens int None Maximum tokens to generate
stop str or List[str] None Stop sequence(s)
stream bool False Enable SSE streaming
presence_penalty float 0.0 Presence penalty
frequency_penalty float 0.0 Frequency penalty

ChatCompletionResponse Fields

Field Type Description
id str Unique completion ID (format: chatcmpl-{shortuuid})
object str Always "chat.completion"
created int Unix timestamp
model str Model identifier used
choices List[ChatCompletionResponseChoice] Each has index, message (role, content), finish_reason
usage UsageInfo prompt_tokens, completion_tokens, total_tokens

Request Processing Flow (create_chat_completion)

  1. check_model -- Verify the requested model exists via controller's /list_models
  2. check_requests -- Validate parameter ranges (max_tokens > 0, 0 <= temperature <= 2, 0 <= top_p <= 1, etc.)
  3. get_worker_address -- Obtain a worker address from the controller via /get_worker_address
  4. get_gen_params -- Fetch conversation template from the worker, apply messages to template, construct generation parameters dict
  5. check_length -- Verify prompt + max_tokens fits within the model's context window via worker's /count_token and /model_details
  6. Dispatch -- If stream=true, return StreamingResponse from chat_completion_stream_generator; otherwise, gather n async completions and return ChatCompletionResponse

Error Handling

Error Code Constant Condition
INVALID_MODEL ErrorCode.INVALID_MODEL Requested model not in controller's model list
PARAM_OUT_OF_RANGE ErrorCode.PARAM_OUT_OF_RANGE Parameter validation failure (temperature, top_p, max_tokens, etc.)
CONTEXT_OVERFLOW ErrorCode.CONTEXT_OVERFLOW Prompt tokens exceed model's context length
INTERNAL_ERROR ErrorCode.INTERNAL_ERROR Worker-side error or async task failure
401 Unauthorized HTTP status Invalid or missing API key when keys are configured

Usage Examples

Starting the API Server

# Basic startup
python3 -m fastchat.serve.openai_api_server

# With API key authentication
python3 -m fastchat.serve.openai_api_server \
    --api-keys "sk-key1,sk-key2"

# With custom CORS and SSL
SSL_KEYFILE=/path/to/key.pem SSL_CERTFILE=/path/to/cert.pem \
    python3 -m fastchat.serve.openai_api_server \
    --host 0.0.0.0 \
    --port 443 \
    --ssl \
    --allowed-origins '["https://myapp.example.com"]'

Chat Completion (Non-Streaming)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Hello! What is your name?"}],
    "temperature": 0.7
  }'

Example response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1707307200,
  "model": "vicuna-7b-v1.5",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I am Vicuna, a language model trained by researchers from LMSYS."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 22,
    "total_tokens": 34
  }
}

Chat Completion (Streaming)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vicuna-7b-v1.5",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "stream": true
  }'

Example streamed response:

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":"Why"},"finish_reason":null}]}

data: {"id":"chatcmpl-xyz789","object":"chat.completion.chunk","created":1707307200,"model":"vicuna-7b-v1.5","choices":[{"index":0,"delta":{"content":" did"},"finish_reason":null}]}

data: [DONE]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment