Workflow:Lm sys FastChat Distributed Model Serving
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Serving, Distributed_Systems |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
End-to-end process for deploying LLM inference as a distributed service with an OpenAI-compatible REST API, using FastChat's controller-worker architecture.
Description
This workflow covers the deployment of language models using FastChat's three-tier distributed serving architecture. A central Controller manages worker registration, heartbeats, and request routing. One or more Model Workers load models and handle inference (supporting 8+ backends including HuggingFace, vLLM, SGLang, and MLX). An OpenAI-compatible API Server exposes chat completions, text completions, and embeddings endpoints. The system supports streaming responses, multi-model serving, model parallelism across GPUs, and various quantization strategies (8-bit, GPTQ, AWQ, ExLlama). Alternatively, a Gradio Web UI can serve as the frontend.
Usage
Execute this workflow when you need to deploy one or more language models as an API service accessible via the OpenAI Python client or standard HTTP requests. This is the standard approach for making FastChat-trained models (or any HuggingFace-compatible model) available for applications, testing, or as a drop-in replacement for OpenAI APIs.
Execution Steps
Step 1: Environment Setup
Install FastChat with the model_worker and webui extras. For high-throughput serving, optionally install a specialized inference backend such as vLLM or SGLang. The base installation includes FastAPI, Uvicorn, and all required serving dependencies.
Key considerations:
- Base install: pip3 install "fschat[model_worker,webui]"
- For vLLM backend: install vllm separately
- For SGLang backend: install sglang separately
- The openai Python package (>=1.0) is needed on the client side
Step 2: Launch Controller
Start the central controller process that manages worker registration and request dispatch. The controller runs as a FastAPI service that maintains a registry of active model workers, monitors heartbeats, and routes incoming requests to available workers using configurable dispatch strategies.
Key considerations:
- Default address: http://localhost:21001
- Dispatch methods: lottery (random weighted) or shortest_queue
- Workers are automatically removed when heartbeats expire (default: 90 seconds)
- The controller must be started before any workers
Step 3: Launch Model Workers
Start one or more model worker processes, each loading a model for inference. Workers register themselves with the controller and begin sending periodic heartbeats. Each worker exposes a local FastAPI endpoint for receiving generation requests. Multiple workers can serve the same model for throughput scaling or different models for multi-model deployments.
Key considerations:
- Each worker loads one model specified by --model-path
- Use --num-gpus for model parallelism across multiple GPUs
- Use --load-8bit for 8-bit quantization to reduce memory
- GPTQ, AWQ, and ExLlama quantization are supported via additional flags
- Multiple workers can run on different ports with different models
- The vLLM worker (vllm_worker.py) provides batched, high-throughput serving
- Verify connectivity with: python3 -m fastchat.serve.test_message --model-name MODEL
Step 4: Launch API Server
Start the OpenAI-compatible REST API server that accepts client requests and proxies them to the controller for worker dispatch. The API server implements the OpenAI Chat Completions, Completions, and Embeddings endpoints, supporting both streaming and non-streaming responses.
Key considerations:
- Default address: http://localhost:8000
- Supports all OpenAI API endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models
- Compatible with the openai Python client library (set api_key="EMPTY", base_url to server address)
- Supports API key authentication via --api-keys flag
- Worker timeout is configurable via FASTCHAT_WORKER_API_TIMEOUT environment variable (default: 100s)
- Alternatively, launch a Gradio web server for browser-based chat UI
Step 5: Client Interaction
Send requests to the API server using the OpenAI Python client, cURL, or any HTTP client. The request flows from client to API server, which queries the controller for an available worker address, then streams the response back from the selected worker.
Key considerations:
- The API is fully compatible with the openai-python library
- Streaming is supported for real-time token generation
- Multiple models can be queried by specifying different model names
- The /v1/models endpoint lists all available models
- LangChain integration is supported for agent-based applications