Principle:Lm sys FastChat Worker Dispatch Control
| Field | Value |
|---|---|
| Page Type | Principle |
| Repository | lm-sys/FastChat |
| Domain | Distributed Systems, Load Balancing, Service Orchestration |
| Knowledge Sources | Source code analysis of fastchat/serve/controller.py
|
| Last Updated | 2026-02-07 14:00 GMT |
| Implemented By | Implementation:Lm_sys_FastChat_Controller_Dispatch |
Overview
Worker Dispatch Control is the principle governing how a centralized controller manages the routing of inference requests to distributed model workers in a large-scale language model serving system. The controller acts as a single coordination point that maintains a live registry of available workers, monitors their health through periodic heartbeats, and selects the optimal worker for each incoming request according to a configurable dispatch strategy. This principle is foundational to the FastChat distributed serving architecture, enabling horizontal scaling of model inference across multiple GPU-equipped machines.
Description
In a distributed model serving environment, multiple model worker processes may be running the same or different models across a cluster of machines. The core challenge is to efficiently route each client request to an appropriate worker that (a) hosts the requested model and (b) has sufficient capacity to handle the request promptly. Worker Dispatch Control addresses this through four key sub-principles:
Centralized Worker Registry
All model workers register with a single controller upon startup. The controller maintains an in-memory dictionary mapping worker addresses to their metadata, including:
- Model names -- the list of model identifiers each worker can serve
- Speed -- a relative throughput indicator for the worker
- Queue length -- the current number of pending requests
- Heartbeat status -- whether the worker is alive and responsive
- Multimodal capability -- whether the worker supports vision/multimodal models
Registration is idempotent: a worker that re-registers simply updates its entry. Workers that become unresponsive are automatically pruned from the registry.
Dispatch Strategies
The controller supports two dispatch methods for selecting a worker to handle a given request:
- Lottery (weighted random) -- Workers are selected randomly with probability proportional to their reported speed. Faster workers receive more requests on average, but load distribution is stochastic. This strategy is simple and avoids the overhead of queue-length tracking but may lead to temporary imbalances.
- Shortest Queue -- The controller selects the worker with the smallest ratio of queue length to speed (
queue_length / speed). After dispatching, the controller optimistically increments the selected worker's queue length by one. This strategy provides more deterministic load balancing and is the default in FastChat.
The dispatch method is specified at controller startup and applies uniformly to all requests.
Worker Health Monitoring via Heartbeats
Workers periodically send heartbeat signals to the controller. Each heartbeat includes the worker's current queue length, which the controller uses to update its routing decisions. The heartbeat mechanism serves two purposes:
- Liveness detection -- If a worker fails to send a heartbeat within the configured expiration window (
CONTROLLER_HEART_BEAT_EXPIRATION), the controller removes it from the registry. A background thread periodically scans for and removes stale workers.
- Load tracking -- The queue length reported in each heartbeat gives the controller a near-real-time view of worker utilization, which is critical for the shortest-queue dispatch strategy.
If a heartbeat arrives from a worker not in the registry (e.g., after the worker was pruned), the controller signals the worker to re-register.
Worker Deregistration and Recovery
Workers can be explicitly removed from the registry, or they can be implicitly removed when their heartbeats expire. The controller also supports a bulk refresh operation (refresh_all_workers) that re-probes all known workers and removes any that do not respond. This is useful for recovering from transient network issues or for administrative maintenance.
Hierarchical Management
The controller can itself act as a worker by exposing worker-compatible API endpoints. This enables hierarchical topologies where a top-level controller aggregates sub-controllers, each managing its own set of workers. This pattern is useful for connecting isolated sub-networks or geographic regions.
Usage
Worker Dispatch Control is used whenever FastChat operates in distributed mode, which is the recommended deployment for production use. The typical deployment involves:
- Start a controller process on a known host and port
- Start one or more model worker processes, each pointing to the controller's address
- Workers automatically register and begin sending heartbeats
- The OpenAI-compatible API server or Gradio web server queries the controller to obtain a worker address for each inference request
This principle applies to all model types supported by FastChat (language models, multimodal models, embedding models) and all inference backends (HuggingFace, vLLM, SGLang, MLX).
Theoretical Basis
Worker Dispatch Control draws on several established distributed systems concepts:
- Service Discovery and Registration -- The controller implements a service registry pattern, analogous to systems like Consul, etcd, or ZooKeeper, but specialized for model serving. Workers self-register and the controller maintains an eventually-consistent view of available workers.
- Load Balancing Algorithms -- The lottery dispatch method is a variant of weighted random load balancing, while the shortest-queue method is a variant of least-connections load balancing. Both are well-studied approaches in distributed systems literature with different trade-offs between simplicity, fairness, and responsiveness.
- Failure Detection via Heartbeats -- The heartbeat-based liveness detection follows the phi-accrual failure detector concept in simplified form. A fixed expiration window is used rather than adaptive thresholds, trading precision for implementation simplicity.
- Optimistic Concurrency Control -- In shortest-queue dispatch, the controller increments the queue length optimistically (before the worker confirms receipt) to reduce the likelihood of overloading a single worker with concurrent requests between heartbeat updates.
Related Pages
- Implementation:Lm_sys_FastChat_Controller_Dispatch
- Implementation:Lm_sys_FastChat_Controller_Dispatch -- API documentation for the Controller class that implements this principle
- Principle:Lm_sys_FastChat_Model_Worker_Inference -- The model worker inference principle, covering worker-side behavior
- Principle:Lm_sys_FastChat_OpenAI_Compatible_API_Serving -- The API server that relies on the controller for request routing