Implementation:OpenGVLab InternVL LLaVA Model Worker
| Knowledge Sources | |
|---|---|
| Domains | Serving, Inference, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Model worker that loads a LLaVA model and serves streaming inference requests via a FastAPI endpoint, registering with a central controller for distributed deployment.
Description
The ModelWorker class handles the complete lifecycle of a model serving instance. On initialization, it loads the model via load_pretrained_model() (supporting 8-bit and 4-bit quantization), determines if the model is multimodal by checking for "llava" or "intern" in the model name, registers with the controller, and starts a heartbeat thread. The generate_stream() method processes inference requests: it decodes base64-encoded images, runs them through process_images(), constructs the prompt with appropriate image token replacements (handling mm_use_im_start_end), tokenizes with tokenizer_image_token(), and generates using a background thread with TextIteratorStreamer for streaming output. Concurrency is managed through an asyncio.Semaphore (default limit of 5 concurrent requests), with the semaphore released via background tasks after response completion. The generate_stream_gate() wrapper provides error handling for ValueError, CudaError, and generic exceptions. FastAPI endpoints include /worker_generate_stream and /worker_get_status.
Usage
Use this worker to deploy a LLaVA model as a serving backend that accepts streaming generation requests from the controller or directly from clients.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/serve/model_worker.py
- Lines: 1-285
Signature
class ModelWorker:
def __init__(self, controller_addr, worker_addr, worker_id, no_register,
model_path, model_base, model_name, load_8bit, load_4bit, device): ...
def register_to_controller(self): ...
def send_heart_beat(self): ...
def get_queue_length(self): ...
def get_status(self): ...
@torch.inference_mode()
def generate_stream(self, params): ...
def generate_stream_gate(self, params): ...
Import
# Standalone server script:
# python -m llava.serve.model_worker --controller http://localhost:21001 --port 21002 --model-path <path>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model-path | str | Yes | Path to the LLaVA model |
| --controller-address | str | No | Controller URL (default: http://localhost:21001) |
| --worker-address | str | No | This worker's URL (default: http://localhost:21002) |
| --port | int | No | Server port (default: 21002) |
| --limit-model-concurrency | int | No | Max concurrent requests (default: 5) |
| --load-8bit | flag | No | Enable 8-bit quantization |
| --load-4bit | flag | No | Enable 4-bit quantization |
Outputs
| Name | Type | Description |
|---|---|---|
| /worker_generate_stream | StreamingResponse | Streaming JSON chunks with generated text and error codes |
| /worker_get_status | JSON | Worker status including model names, speed, and queue length |
Usage Examples
Basic Usage
# Start a model worker:
# python -m llava.serve.model_worker \
# --controller-address http://localhost:21001 \
# --model-path liuhaotian/llava-v1.5-7b \
# --port 21002
# The worker auto-registers with the controller and starts serving
# Inference requests arrive as JSON with "prompt" and optional "images" (base64)