Workflow:Lm sys FastChat Distributed Model Serving

Knowledge Sources	FastChat FastAPI Documentation OpenAI API Reference vLLM Documentation
Domains	LLMs, Model_Serving, Distributed_Systems
Last Updated	2026-02-07 04:00 GMT

Overview

End-to-end process for deploying LLM inference as a distributed service with an OpenAI-compatible REST API, using FastChat's controller-worker architecture.

Description

This workflow covers the deployment of language models using FastChat's three-tier distributed serving architecture. A central Controller manages worker registration, heartbeats, and request routing. One or more Model Workers load models and handle inference (supporting 8+ backends including HuggingFace, vLLM, SGLang, and MLX). An OpenAI-compatible API Server exposes chat completions, text completions, and embeddings endpoints. The system supports streaming responses, multi-model serving, model parallelism across GPUs, and various quantization strategies (8-bit, GPTQ, AWQ, ExLlama). Alternatively, a Gradio Web UI can serve as the frontend.

Usage

Execute this workflow when you need to deploy one or more language models as an API service accessible via the OpenAI Python client or standard HTTP requests. This is the standard approach for making FastChat-trained models (or any HuggingFace-compatible model) available for applications, testing, or as a drop-in replacement for OpenAI APIs.

Execution Steps

Step 1: Environment Setup

Install FastChat with the model_worker and webui extras. For high-throughput serving, optionally install a specialized inference backend such as vLLM or SGLang. The base installation includes FastAPI, Uvicorn, and all required serving dependencies.

Key considerations:

Base install: pip3 install "fschat[model_worker,webui]"
For vLLM backend: install vllm separately
For SGLang backend: install sglang separately
The openai Python package (>=1.0) is needed on the client side

Step 2: Launch Controller

Start the central controller process that manages worker registration and request dispatch. The controller runs as a FastAPI service that maintains a registry of active model workers, monitors heartbeats, and routes incoming requests to available workers using configurable dispatch strategies.

Key considerations:

Default address: http://localhost:21001
Dispatch methods: lottery (random weighted) or shortest_queue
Workers are automatically removed when heartbeats expire (default: 90 seconds)
The controller must be started before any workers

Step 3: Launch Model Workers

Start one or more model worker processes, each loading a model for inference. Workers register themselves with the controller and begin sending periodic heartbeats. Each worker exposes a local FastAPI endpoint for receiving generation requests. Multiple workers can serve the same model for throughput scaling or different models for multi-model deployments.

Key considerations:

Each worker loads one model specified by --model-path
Use --num-gpus for model parallelism across multiple GPUs
Use --load-8bit for 8-bit quantization to reduce memory
GPTQ, AWQ, and ExLlama quantization are supported via additional flags
Multiple workers can run on different ports with different models
The vLLM worker (vllm_worker.py) provides batched, high-throughput serving
Verify connectivity with: python3 -m fastchat.serve.test_message --model-name MODEL

Step 4: Launch API Server

Start the OpenAI-compatible REST API server that accepts client requests and proxies them to the controller for worker dispatch. The API server implements the OpenAI Chat Completions, Completions, and Embeddings endpoints, supporting both streaming and non-streaming responses.

Key considerations:

Default address: http://localhost:8000
Supports all OpenAI API endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models
Compatible with the openai Python client library (set api_key="EMPTY", base_url to server address)
Supports API key authentication via --api-keys flag
Worker timeout is configurable via FASTCHAT_WORKER_API_TIMEOUT environment variable (default: 100s)
Alternatively, launch a Gradio web server for browser-based chat UI

Step 5: Client Interaction

Send requests to the API server using the OpenAI Python client, cURL, or any HTTP client. The request flows from client to API server, which queries the controller for an available worker address, then streams the response back from the selected worker.

Key considerations:

The API is fully compatible with the openai-python library
Streaming is supported for real-time token generation
Multiple models can be queried by specifying different model names
The /v1/models endpoint lists all available models
LangChain integration is supported for agent-based applications

Execution Diagram

GitHub URL

Workflow Repository