Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat OpenAI Client Interaction

From Leeroopedia


Field Value
Page Type Principle
Repository lm-sys/FastChat
Domain API Client Design, Chat Completion Protocol, Streaming Consumption
Knowledge Sources Source code analysis of tests/test_openai_api.py, fastchat/protocol/openai_api_protocol.py
Last Updated 2026-02-07 14:00 GMT
Implemented By Implementation:Lm_sys_FastChat_OpenAI_Chat_Completion_Client

Overview

OpenAI Client Interaction is the principle governing how client applications interact with FastChat's OpenAI-compatible API server. Because FastChat faithfully implements the OpenAI REST API specification, clients can use the official OpenAI Python SDK (or any HTTP client) to communicate with FastChat, treating it as a drop-in replacement for the OpenAI service. This principle covers the chat completion message format, streaming consumption via Server-Sent Events, error handling, and token usage tracking.

Description

Drop-In Replacement Concept

The fundamental idea behind OpenAI Client Interaction in FastChat is that any application using the OpenAI API can switch to a self-hosted FastChat backend by changing only two configuration values:

  • base_url -- Point to the FastChat API server instead of https://api.openai.com/v1
  • api_key -- Set to any string (or a configured key if API key authentication is enabled)

No other code changes are required. The request and response schemas, streaming protocol, and error formats are identical. This makes FastChat suitable for:

  • Local development and testing without API costs
  • Air-gapped deployments where cloud API access is not available
  • Privacy-sensitive use cases where data cannot leave the organization
  • Research and experimentation with open-source models

Chat Completion Format

The chat completion interface uses a message-based format where each message has:

  • role -- One of "system", "user", or "assistant"
  • content -- The text content of the message

Messages are provided as an ordered list representing the conversation history. The system message (optional) sets the assistant's behavior. User messages contain the human input. Assistant messages represent previous model outputs (for multi-turn conversations).

The server applies the model's conversation template to these messages, converting the uniform format into model-specific prompt formatting (e.g., Vicuna's USER: ... ASSISTANT: ... format, or Llama-2's [INST] ... [/INST] format).

Streaming with SSE

When stream=true is set, the response arrives as a sequence of Server-Sent Events. Each event contains a JSON chunk with the delta field instead of the message field:

  • The first chunk contains delta: {"role": "assistant"} (no content)
  • Subsequent chunks contain delta: {"content": "token text"}
  • The final chunk has a non-null finish_reason ("stop" or "length")
  • The stream ends with data: [DONE]

Clients iterate over these chunks to display tokens as they arrive. The OpenAI Python SDK handles SSE parsing automatically, exposing chunks as iterable objects.

Error Handling

The API returns structured error responses that match the OpenAI error format:

  • Invalid model -- 400 status with message indicating which models are available
  • Parameter out of range -- 400 status with specific parameter validation error
  • Context overflow -- 400 status when the prompt exceeds the model's context length
  • Invalid API key -- 401 status with invalid_api_key error code
  • No available worker -- Internal error when no worker can serve the requested model

Clients should handle these errors gracefully, typically by checking the response status code and parsing the error body.

Token Usage Tracking

Every non-streaming response includes a usage object with:

  • prompt_tokens -- Number of tokens in the input prompt
  • completion_tokens -- Number of tokens generated
  • total_tokens -- Sum of prompt and completion tokens

This enables clients to track token consumption for cost estimation, context window management, and performance monitoring. Note that in streaming mode, usage information is not included in individual chunks (consistent with OpenAI's behavior).

Usage

Client interaction with FastChat follows the standard OpenAI SDK patterns:

  1. Install the OpenAI Python package: pip install openai
  2. Configure the client to point to the FastChat server
  3. Use the same API calls as for OpenAI (chat completions, completions, embeddings, model listing)

Both synchronous and asynchronous clients are supported. The cURL command-line tool can also be used for testing and simple integrations.

Theoretical Basis

  • API Compatibility as Migration Strategy -- By providing API-level compatibility, FastChat eliminates the switching cost for applications migrating from cloud to self-hosted inference. This follows the industry pattern of "wire-compatible" alternatives (e.g., MinIO for S3, CockroachDB for PostgreSQL).
  • Message-Based Chat Protocol -- The role-tagged message format originated with OpenAI's ChatGPT API and has become the de facto standard for conversational AI interfaces. The format cleanly separates conversation structure from model-specific prompt engineering.
  • Server-Sent Events for Streaming -- SSE provides a simple, HTTP-native streaming mechanism. Unlike WebSockets, SSE requires no connection upgrade, works through HTTP proxies, and supports automatic browser reconnection. The data: [DONE] sentinel follows the convention established by OpenAI.
  • Structured Error Responses -- Returning machine-readable error objects (with code, message, and type fields) enables programmatic error handling in client applications, following REST API best practices.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment