Implementation:Haotian liu LLaVA ModelWorker Class
Appearance
Overview
Concrete tool for serving a LLaVA model as an inference worker with streaming generation. The ModelWorker loads a model into GPU memory and exposes it via FastAPI endpoints.
Source
- File:
llava/serve/model_worker.py - Lines: L44-219 (ModelWorker class), L252-288 (main + FastAPI routes)
Signature
class ModelWorker:
def __init__(self, controller_addr, worker_addr, worker_id, no_register,
model_path, model_base, model_name,
load_8bit, load_4bit, device, use_flash_attn=False):
"""
Initialize worker: load model, register with controller, start heartbeat.
"""
def register_to_controller(self):
"""Send registration request to the controller with worker metadata."""
def send_heart_beat(self):
"""Send periodic heartbeat to the controller (every 30 seconds)."""
@torch.inference_mode()
def generate_stream(self, params):
"""
Generate text response in streaming mode.
Processes images, constructs prompt, runs generation in a thread
with TextIteratorStreamer.
"""
def get_status(self) -> dict:
"""Return worker status including model names, speed, and queue length."""
CLI Usage
python -m llava.serve.model_worker \
--controller http://localhost:21001 \
--port 40000 \
--model-path liuhaotian/llava-v1.5-13b
With quantization:
python -m llava.serve.model_worker \
--controller http://localhost:21001 \
--port 40000 \
--model-path liuhaotian/llava-v1.5-13b \
--load-4bit
With LoRA adapter:
python -m llava.serve.model_worker \
--controller http://localhost:21001 \
--port 40000 \
--model-path /path/to/lora-adapter \
--model-base liuhaotian/llava-v1.5-13b
Import
from llava.serve.model_worker import ModelWorker
Inputs
| Parameter | Type | Description |
|---|---|---|
controller_addr |
str | URL of the controller (e.g., http://localhost:21001)
|
worker_addr |
str | This worker's address (e.g., http://localhost:40000)
|
model_path |
str | Path or HuggingFace ID of the model checkpoint |
model_base |
str | Base model path (required for LoRA adapters) |
load_8bit |
bool | Enable 8-bit quantization |
load_4bit |
bool | Enable 4-bit NF4 quantization |
device |
str | Device to load model on (default: cuda)
|
use_flash_attn |
bool | Enable Flash Attention 2 |
Outputs
Running HTTP worker server exposing the following endpoints:
| Endpoint | Method | Description |
|---|---|---|
/worker_generate_stream |
POST | Stream generated text tokens as server-sent events |
/worker_get_status |
POST | Return worker status (model names, speed, queue length) |
Description
The ModelWorker class is the workhorse of LLaVA's distributed serving architecture. Each instance:
- Loads a LLaVA model using
load_pretrained_model()at initialization - Registers with the controller, providing its address and model metadata
- Starts a background heartbeat thread (30-second interval)
- Serves streaming inference requests via
generate_stream()
The generate_stream() method handles the full inference pipeline:
- Decodes base64 images from the request
- Processes images through the CLIP preprocessor
- Constructs the tokenized prompt with image token placeholders
- Launches generation in a separate thread with
TextIteratorStreamer - Yields tokens as they are produced
An asyncio semaphore (default limit: 5) prevents GPU memory exhaustion from concurrent requests.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | Model_Serving, Streaming_Inference |
| Last Updated | 2026-02-13 14:00 GMT |
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment