Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL LLaVA Model Worker

From Leeroopedia
Revision as of 16:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_LLaVA_Model_Worker.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Serving, Inference, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

Model worker that loads a LLaVA model and serves streaming inference requests via a FastAPI endpoint, registering with a central controller for distributed deployment.

Description

The ModelWorker class handles the complete lifecycle of a model serving instance. On initialization, it loads the model via load_pretrained_model() (supporting 8-bit and 4-bit quantization), determines if the model is multimodal by checking for "llava" or "intern" in the model name, registers with the controller, and starts a heartbeat thread. The generate_stream() method processes inference requests: it decodes base64-encoded images, runs them through process_images(), constructs the prompt with appropriate image token replacements (handling mm_use_im_start_end), tokenizes with tokenizer_image_token(), and generates using a background thread with TextIteratorStreamer for streaming output. Concurrency is managed through an asyncio.Semaphore (default limit of 5 concurrent requests), with the semaphore released via background tasks after response completion. The generate_stream_gate() wrapper provides error handling for ValueError, CudaError, and generic exceptions. FastAPI endpoints include /worker_generate_stream and /worker_get_status.

Usage

Use this worker to deploy a LLaVA model as a serving backend that accepts streaming generation requests from the controller or directly from clients.

Code Reference

Source Location

Signature

class ModelWorker:
    def __init__(self, controller_addr, worker_addr, worker_id, no_register,
                 model_path, model_base, model_name, load_8bit, load_4bit, device): ...
    def register_to_controller(self): ...
    def send_heart_beat(self): ...
    def get_queue_length(self): ...
    def get_status(self): ...
    @torch.inference_mode()
    def generate_stream(self, params): ...
    def generate_stream_gate(self, params): ...

Import

# Standalone server script:
# python -m llava.serve.model_worker --controller http://localhost:21001 --port 21002 --model-path <path>

I/O Contract

Inputs

Name Type Required Description
--model-path str Yes Path to the LLaVA model
--controller-address str No Controller URL (default: http://localhost:21001)
--worker-address str No This worker's URL (default: http://localhost:21002)
--port int No Server port (default: 21002)
--limit-model-concurrency int No Max concurrent requests (default: 5)
--load-8bit flag No Enable 8-bit quantization
--load-4bit flag No Enable 4-bit quantization

Outputs

Name Type Description
/worker_generate_stream StreamingResponse Streaming JSON chunks with generated text and error codes
/worker_get_status JSON Worker status including model names, speed, and queue length

Usage Examples

Basic Usage

# Start a model worker:
# python -m llava.serve.model_worker \
#     --controller-address http://localhost:21001 \
#     --model-path liuhaotian/llava-v1.5-7b \
#     --port 21002

# The worker auto-registers with the controller and starts serving
# Inference requests arrive as JSON with "prompt" and optional "images" (base64)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment