Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA ModelWorker Class

From Leeroopedia

Overview

Concrete tool for serving a LLaVA model as an inference worker with streaming generation. The ModelWorker loads a model into GPU memory and exposes it via FastAPI endpoints.

Source

  • File: llava/serve/model_worker.py
  • Lines: L44-219 (ModelWorker class), L252-288 (main + FastAPI routes)

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id, no_register,
                 model_path, model_base, model_name,
                 load_8bit, load_4bit, device, use_flash_attn=False):
        """
        Initialize worker: load model, register with controller, start heartbeat.
        """

    def register_to_controller(self):
        """Send registration request to the controller with worker metadata."""

    def send_heart_beat(self):
        """Send periodic heartbeat to the controller (every 30 seconds)."""

    @torch.inference_mode()
    def generate_stream(self, params):
        """
        Generate text response in streaming mode.
        Processes images, constructs prompt, runs generation in a thread
        with TextIteratorStreamer.
        """

    def get_status(self) -> dict:
        """Return worker status including model names, speed, and queue length."""

CLI Usage

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path liuhaotian/llava-v1.5-13b

With quantization:

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path liuhaotian/llava-v1.5-13b \
    --load-4bit

With LoRA adapter:

python -m llava.serve.model_worker \
    --controller http://localhost:21001 \
    --port 40000 \
    --model-path /path/to/lora-adapter \
    --model-base liuhaotian/llava-v1.5-13b

Import

from llava.serve.model_worker import ModelWorker

Inputs

Parameter Type Description
controller_addr str URL of the controller (e.g., http://localhost:21001)
worker_addr str This worker's address (e.g., http://localhost:40000)
model_path str Path or HuggingFace ID of the model checkpoint
model_base str Base model path (required for LoRA adapters)
load_8bit bool Enable 8-bit quantization
load_4bit bool Enable 4-bit NF4 quantization
device str Device to load model on (default: cuda)
use_flash_attn bool Enable Flash Attention 2

Outputs

Running HTTP worker server exposing the following endpoints:

Endpoint Method Description
/worker_generate_stream POST Stream generated text tokens as server-sent events
/worker_get_status POST Return worker status (model names, speed, queue length)

Description

The ModelWorker class is the workhorse of LLaVA's distributed serving architecture. Each instance:

  1. Loads a LLaVA model using load_pretrained_model() at initialization
  2. Registers with the controller, providing its address and model metadata
  3. Starts a background heartbeat thread (30-second interval)
  4. Serves streaming inference requests via generate_stream()

The generate_stream() method handles the full inference pipeline:

  • Decodes base64 images from the request
  • Processes images through the CLIP preprocessor
  • Constructs the tokenized prompt with image token placeholders
  • Launches generation in a separate thread with TextIteratorStreamer
  • Yields tokens as they are produced

An asyncio semaphore (default limit: 5) prevents GPU memory exhaustion from concurrent requests.

Metadata

Field Value
Knowledge Sources Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains Model_Serving, Streaming_Inference
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment