Implementation:OpenGVLab InternVL LLaVA Model Worker

Knowledge Sources	OpenGVLab_InternVL
Domains	Serving, Inference, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

Model worker that loads a LLaVA model and serves streaming inference requests via a FastAPI endpoint, registering with a central controller for distributed deployment.

Description

The ModelWorker class handles the complete lifecycle of a model serving instance. On initialization, it loads the model via load_pretrained_model() (supporting 8-bit and 4-bit quantization), determines if the model is multimodal by checking for "llava" or "intern" in the model name, registers with the controller, and starts a heartbeat thread. The generate_stream() method processes inference requests: it decodes base64-encoded images, runs them through process_images(), constructs the prompt with appropriate image token replacements (handling mm_use_im_start_end), tokenizes with tokenizer_image_token(), and generates using a background thread with TextIteratorStreamer for streaming output. Concurrency is managed through an asyncio.Semaphore (default limit of 5 concurrent requests), with the semaphore released via background tasks after response completion. The generate_stream_gate() wrapper provides error handling for ValueError, CudaError, and generic exceptions. FastAPI endpoints include /worker_generate_stream and /worker_get_status.

Usage

Use this worker to deploy a LLaVA model as a serving backend that accepts streaming generation requests from the controller or directly from clients.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/serve/model_worker.py
Lines: 1-285

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id, no_register,
                 model_path, model_base, model_name, load_8bit, load_4bit, device): ...
    def register_to_controller(self): ...
    def send_heart_beat(self): ...
    def get_queue_length(self): ...
    def get_status(self): ...
    @torch.inference_mode()
    def generate_stream(self, params): ...
    def generate_stream_gate(self, params): ...

Import

# Standalone server script:
# python -m llava.serve.model_worker --controller http://localhost:21001 --port 21002 --model-path <path>

I/O Contract

Inputs

Name	Type	Required	Description
--model-path	str	Yes	Path to the LLaVA model
--controller-address	str	No	Controller URL (default: http://localhost:21001)
--worker-address	str	No	This worker's URL (default: http://localhost:21002)
--port	int	No	Server port (default: 21002)
--limit-model-concurrency	int	No	Max concurrent requests (default: 5)
--load-8bit	flag	No	Enable 8-bit quantization
--load-4bit	flag	No	Enable 4-bit quantization

Outputs

Name	Type	Description
/worker_generate_stream	StreamingResponse	Streaming JSON chunks with generated text and error codes
/worker_get_status	JSON	Worker status including model names, speed, and queue length

Usage Examples

Basic Usage

# Start a model worker:
# python -m llava.serve.model_worker \
#     --controller-address http://localhost:21001 \
#     --model-path liuhaotian/llava-v1.5-7b \
#     --port 21002

# The worker auto-registers with the controller and starts serving
# Inference requests arrive as JSON with "prompt" and optional "images" (base64)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment