Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lm sys FastChat Distributed Model Serving

From Leeroopedia


Knowledge Sources
Domains LLMs, Model_Serving, Distributed_Systems
Last Updated 2026-02-07 04:00 GMT

Overview

End-to-end process for deploying LLM inference as a distributed service with an OpenAI-compatible REST API, using FastChat's controller-worker architecture.

Description

This workflow covers the deployment of language models using FastChat's three-tier distributed serving architecture. A central Controller manages worker registration, heartbeats, and request routing. One or more Model Workers load models and handle inference (supporting 8+ backends including HuggingFace, vLLM, SGLang, and MLX). An OpenAI-compatible API Server exposes chat completions, text completions, and embeddings endpoints. The system supports streaming responses, multi-model serving, model parallelism across GPUs, and various quantization strategies (8-bit, GPTQ, AWQ, ExLlama). Alternatively, a Gradio Web UI can serve as the frontend.

Usage

Execute this workflow when you need to deploy one or more language models as an API service accessible via the OpenAI Python client or standard HTTP requests. This is the standard approach for making FastChat-trained models (or any HuggingFace-compatible model) available for applications, testing, or as a drop-in replacement for OpenAI APIs.

Execution Steps

Step 1: Environment Setup

Install FastChat with the model_worker and webui extras. For high-throughput serving, optionally install a specialized inference backend such as vLLM or SGLang. The base installation includes FastAPI, Uvicorn, and all required serving dependencies.

Key considerations:

  • Base install: pip3 install "fschat[model_worker,webui]"
  • For vLLM backend: install vllm separately
  • For SGLang backend: install sglang separately
  • The openai Python package (>=1.0) is needed on the client side

Step 2: Launch Controller

Start the central controller process that manages worker registration and request dispatch. The controller runs as a FastAPI service that maintains a registry of active model workers, monitors heartbeats, and routes incoming requests to available workers using configurable dispatch strategies.

Key considerations:

  • Default address: http://localhost:21001
  • Dispatch methods: lottery (random weighted) or shortest_queue
  • Workers are automatically removed when heartbeats expire (default: 90 seconds)
  • The controller must be started before any workers

Step 3: Launch Model Workers

Start one or more model worker processes, each loading a model for inference. Workers register themselves with the controller and begin sending periodic heartbeats. Each worker exposes a local FastAPI endpoint for receiving generation requests. Multiple workers can serve the same model for throughput scaling or different models for multi-model deployments.

Key considerations:

  • Each worker loads one model specified by --model-path
  • Use --num-gpus for model parallelism across multiple GPUs
  • Use --load-8bit for 8-bit quantization to reduce memory
  • GPTQ, AWQ, and ExLlama quantization are supported via additional flags
  • Multiple workers can run on different ports with different models
  • The vLLM worker (vllm_worker.py) provides batched, high-throughput serving
  • Verify connectivity with: python3 -m fastchat.serve.test_message --model-name MODEL

Step 4: Launch API Server

Start the OpenAI-compatible REST API server that accepts client requests and proxies them to the controller for worker dispatch. The API server implements the OpenAI Chat Completions, Completions, and Embeddings endpoints, supporting both streaming and non-streaming responses.

Key considerations:

  • Default address: http://localhost:8000
  • Supports all OpenAI API endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models
  • Compatible with the openai Python client library (set api_key="EMPTY", base_url to server address)
  • Supports API key authentication via --api-keys flag
  • Worker timeout is configurable via FASTCHAT_WORKER_API_TIMEOUT environment variable (default: 100s)
  • Alternatively, launch a Gradio web server for browser-based chat UI

Step 5: Client Interaction

Send requests to the API server using the OpenAI Python client, cURL, or any HTTP client. The request flows from client to API server, which queries the controller for an available worker address, then streams the response back from the selected worker.

Key considerations:

  • The API is fully compatible with the openai-python library
  • Streaming is supported for real-time token generation
  • Multiple models can be queried by specifying different model names
  • The /v1/models endpoint lists all available models
  • LangChain integration is supported for agent-based applications

Execution Diagram

GitHub URL

Workflow Repository