Principle:Haotian liu LLaVA Distributed Worker Control

Overview

Architecture pattern for managing distributed model inference workers through a centralized controller with heartbeat monitoring.

Description

The controller pattern implements a centralized dispatcher that manages multiple model workers in a distributed LLaVA serving deployment. This pattern provides the backbone for horizontal scaling of inference capacity.

The controller offers the following capabilities:

Worker registration with heartbeat monitoring -- Workers register on startup and send periodic heartbeats. The controller removes stale workers after 90 seconds of missed heartbeats, ensuring only healthy workers receive traffic.
Two dispatch methods:
- Lottery -- Distributes requests proportionally to worker speed (useful for heterogeneous GPU setups where faster GPUs should handle more requests).
- Shortest queue -- Routes each request to the worker with the fewest pending requests, minimizing response latency.
Model listing aggregation -- Aggregates available models across all registered workers, providing a unified model catalog to the frontend.
Proxied inference requests -- The controller proxies inference requests from the frontend to the appropriate worker, abstracting the worker topology from the user-facing interface.

This architecture enables horizontal scaling of inference capacity by adding more model workers behind the controller without changing the frontend configuration.

Usage

Deploy a controller when serving LLaVA to multiple concurrent users or when running multiple model workers (different models or replicas). The controller acts as the single entry point for the Gradio web server or any other frontend.

A typical deployment topology:

[Gradio UI] --> [Controller :21001] --> [Worker A :40000] (llava-v1.5-13b)
                                    --> [Worker B :40001] (llava-v1.5-7b)
                                    --> [Worker C :40002] (llava-v1.5-13b, replica)

Theoretical Basis

Shortest-queue dispatch minimizes response latency by routing to the worker with the fewest pending requests. This is optimal when workers are homogeneous (same GPU, same model).
Lottery dispatch distributes load proportionally to worker speed. Each worker's "ticket count" corresponds to its processing speed, so faster workers receive more requests in expectation. This is optimal for heterogeneous GPU setups.
Heartbeat interval is 30 seconds with a 90-second expiration (3 missed heartbeats). This balances responsiveness to failures against network overhead.

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	Distributed_Systems, Model_Serving
Last Updated	2026-02-13 14:00 GMT

Related Pages

Implementation:Haotian_liu_LLaVA_Controller_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment