Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat Model Worker Inference

From Leeroopedia


Field Value
Page Type Principle
Repository lm-sys/FastChat
Domain Machine Learning Inference, Distributed Systems, Streaming Generation
Knowledge Sources Source code analysis of fastchat/serve/model_worker.py, fastchat/serve/base_model_worker.py, fastchat/serve/inference.py
Last Updated 2026-02-07 14:00 GMT
Implemented By Implementation:Lm_sys_FastChat_ModelWorker_Load_And_Generate

Overview

Model Worker Inference is the principle governing how individual model worker processes load language models, register with a central controller, maintain liveness via heartbeats, and execute streaming token generation in the FastChat distributed serving system. Each model worker is a self-contained inference server that hosts a specific model and responds to generation requests forwarded by the controller or API server. This principle covers the entire lifecycle of a worker: model loading, controller registration, heartbeat maintenance, request handling, streaming output, and resource cleanup.

Description

Worker-Controller Architecture with Heartbeat

Each model worker operates as an independent FastAPI HTTP server. Upon initialization, the worker:

  1. Loads the specified model into GPU (or CPU/other device) memory
  2. Registers itself with the central controller by sending its address, model names, and status
  3. Starts a background heartbeat thread that periodically sends keep-alive signals to the controller

The heartbeat serves a dual purpose: it confirms the worker is still alive, and it reports the worker's current queue length so the controller can make informed dispatch decisions. If the controller responds that the worker is unknown (e.g., after a controller restart), the worker automatically re-registers.

The worker uses an asyncio semaphore to limit concurrency. The limit_worker_concurrency parameter controls how many requests can be processed simultaneously, preventing GPU out-of-memory errors from excessive parallel inference.

Model Loading with Adapter Pattern

FastChat uses a model adapter pattern to support a wide range of model architectures through a unified interface. The load_model function from fastchat.model.model_adapter handles:

  • Automatic model detection -- Identifies the model type from the model path and selects the appropriate loading strategy
  • Quantization support -- Supports 8-bit loading, GPTQ, AWQ, ExLlama, and xFasterTransformer quantization configurations
  • Multi-GPU distribution -- Distributes model layers across multiple GPUs when num_gpus > 1
  • Device flexibility -- Supports CUDA, CPU, MPS (Apple Silicon), XPU (Intel), and NPU (Ascend) devices

After loading, the worker obtains a generate_stream_func via get_generate_stream_function, which returns the appropriate streaming generation function for the model's backend (HuggingFace Transformers, vLLM, SGLang, MLX, etc.).

Streaming Token Generation with Logits Processing

The core inference logic produces tokens one at a time (autoregressive generation) and periodically yields partial results to enable streaming output. The generation process involves:

Logits Processing Pipeline: Before sampling each token, the raw logits from the model pass through a configurable chain of processors:

  • TemperatureLogitsWarper -- Scales logits by 1/temperature. Lower temperatures make the distribution sharper (more deterministic); higher temperatures make it flatter (more random). Skipped when temperature is 0 (greedy) or 1.0 (no-op).
  • RepetitionPenaltyLogitsProcessor -- Penalizes tokens that have already appeared in the output, reducing repetition. Only applied when repetition_penalty > 1.0.
  • TopPLogitsWarper (nucleus sampling) -- Keeps only the smallest set of tokens whose cumulative probability exceeds top_p, then renormalizes. Applied when 0 < top_p < 1.0.
  • TopKLogitsWarper -- Keeps only the top_k highest-probability tokens. Applied when top_k > 0.

Sampling: After logits processing, token selection is either greedy (when temperature < 1e-5 or top_p < 1e-8) via torch.topk, or stochastic via torch.multinomial sampling from the softmax distribution.

Streaming Output: The generator yields partial results at intervals controlled by stream_interval (default: every 2 tokens). Each yield includes:

  • The decoded text so far
  • Optional log probabilities for each token
  • Token usage statistics (prompt_tokens, completion_tokens, total_tokens)
  • A finish reason (None while generating, "stop" on stop token/string, "length" on max tokens)

Stop Conditions: Generation stops when:

  • A stop token ID is encountered (including the model's EOS token)
  • A stop string is found in the decoded output
  • The maximum number of new tokens is reached

Multiple Backend Support

The model worker is designed as a base class (BaseModelWorker) with concrete implementations for different inference backends. The standard ModelWorker uses HuggingFace Transformers directly, but the same base class is extended for:

  • vLLM -- High-throughput serving with PagedAttention
  • SGLang -- Optimized serving with RadixAttention
  • MLX -- Apple Silicon optimized inference
  • LightLLM -- Lightweight inference engine

Each backend provides its own generate_stream_func that conforms to the same interface, yielding dictionaries with "text", "logprobs", "usage", and "finish_reason" keys.

Usage

Model Worker Inference is used in every FastChat deployment that serves models. The standard three-process deployment is:

  1. Start the controller: python3 -m fastchat.serve.controller
  2. Start one or more model workers: python3 -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.5
  3. Start the API server: python3 -m fastchat.serve.openai_api_server

Multiple workers can serve the same model for horizontal scaling, or different models for multi-model serving. Each worker process is independent and can run on a different machine.

Theoretical Basis

  • Autoregressive Generation -- Language models generate text one token at a time, conditioning each new token on all previous tokens. This is the standard decoding approach for transformer-based causal language models.
  • KV-Cache Optimization -- The past_key_values mechanism avoids recomputing attention over previously generated tokens. Each decoding step only processes the single new token, dramatically reducing computation from O(n^2) to O(n) per step.
  • Nucleus Sampling (Top-p) -- Introduced by Holtzman et al. (2020), this sampling strategy dynamically adjusts the vocabulary size at each step based on the cumulative probability threshold, producing more natural text than fixed top-k sampling.
  • Concurrency Control via Semaphores -- The worker uses counting semaphores to bound the number of concurrent inference requests, which is critical for GPU memory management where exceeding available memory causes immediate process failure.
  • Heartbeat-Based Health Monitoring -- Workers implement the client side of the heartbeat protocol, periodically confirming liveness to the controller. This follows the push-based failure detection model where the monitored entity is responsible for proving it is alive.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment