Implementation:Haotian liu LLaVA ModelWorker Class

Overview

Concrete tool for serving a LLaVA model as an inference worker with streaming generation. The ModelWorker loads a model into GPU memory and exposes it via FastAPI endpoints.

Source

File: llava/serve/model_worker.py
Lines: L44-219 (ModelWorker class), L252-288 (main + FastAPI routes)

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id, no_register,
                 model_path, model_base, model_name,
                 load_8bit, load_4bit, device, use_flash_attn=False):
        """
        Initialize worker: load model, register with controller, start heartbeat.
        """

    def register_to_controller(self):
        """Send registration request to the controller with worker metadata."""

    def send_heart_beat(self):
        """Send periodic heartbeat to the controller (every 30 seconds)."""

    @torch.inference_mode()
    def generate_stream(self, params):
        """
        Generate text response in streaming mode.
        Processes images, constructs prompt, runs generation in a thread
        with TextIteratorStreamer.
        """

    def get_status(self) -> dict:
        """Return worker status including model names, speed, and queue length."""

CLI Usage

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path liuhaotian/llava-v1.5-13b

With quantization:

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path liuhaotian/llava-v1.5-13b \
    --load-4bit

With LoRA adapter:

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path /path/to/lora-adapter \
    --model-base liuhaotian/llava-v1.5-13b

Import

from llava.serve.model_worker import ModelWorker

Inputs

Parameter	Type	Description
`controller_addr`	str	URL of the controller (e.g., `http://localhost:21001`)
`worker_addr`	str	This worker's address (e.g., `http://localhost:40000`)
`model_path`	str	Path or HuggingFace ID of the model checkpoint
`model_base`	str	Base model path (required for LoRA adapters)
`load_8bit`	bool	Enable 8-bit quantization
`load_4bit`	bool	Enable 4-bit NF4 quantization
`device`	str	Device to load model on (default: `cuda`)
`use_flash_attn`	bool	Enable Flash Attention 2

Outputs

Running HTTP worker server exposing the following endpoints:

Endpoint	Method	Description
`/worker_generate_stream`	POST	Stream generated text tokens as server-sent events
`/worker_get_status`	POST	Return worker status (model names, speed, queue length)

Description

The ModelWorker class is the workhorse of LLaVA's distributed serving architecture. Each instance:

Loads a LLaVA model using load_pretrained_model() at initialization
Registers with the controller, providing its address and model metadata
Starts a background heartbeat thread (30-second interval)
Serves streaming inference requests via generate_stream()

The generate_stream() method handles the full inference pipeline:

Decodes base64 images from the request
Processes images through the CLIP preprocessor
Constructs the tokenized prompt with image token placeholders
Launches generation in a separate thread with TextIteratorStreamer
Yields tokens as they are produced

An asyncio semaphore (default limit: 5) prevents GPU memory exhaustion from concurrent requests.

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	Model_Serving, Streaming_Inference
Last Updated	2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment