Principle:Haotian liu LLaVA Model Worker Inference

Overview

Server pattern for hosting a loaded model and providing streaming inference via HTTP endpoints.

Description

A model worker loads a LLaVA model into GPU memory and serves inference requests via FastAPI. This pattern encapsulates the full lifecycle of a single model serving instance within the distributed LLaVA architecture.

Key characteristics:

Model loading -- The worker loads a LLaVA model (including vision tower and projector) into GPU memory at startup using load_pretrained_model().
Streaming text generation -- Uses TextIteratorStreamer for real-time token delivery. Generation runs in a separate thread, writing tokens to a queue that the HTTP handler reads and yields as server-sent events.
Auto-registration -- On startup, the worker registers itself with the controller, providing its address, model name, and speed metadata.
Heartbeat maintenance -- A background thread sends heartbeats to the controller every 30 seconds to maintain registration.
Concurrency management -- An asyncio semaphore limits concurrent requests (default: 5 concurrent requests) to prevent GPU memory exhaustion.
Multimodal input handling -- The worker detects and processes base64-encoded images from incoming requests, converting them to tensors for the model.

Usage

Deploy one or more model workers behind a controller. Each worker hosts one model. Multiple workers can serve the same model (for replicas) or different models.

Use --load-8bit or --load-4bit for quantized inference on smaller GPUs.
Use --model-base for LoRA adapter models.
Use --use-flash-attn to enable Flash Attention 2 for faster inference.

Theoretical Basis

Streaming inference uses a separate generation thread that writes tokens to a TextIteratorStreamer queue. The HTTP handler reads from this queue and yields server-sent events. This architecture provides:

Sub-token-latency first-token delivery -- The user sees the first token as soon as it is generated, rather than waiting for the full response.
Non-blocking serving -- The asyncio event loop remains free to accept new requests while generation proceeds in a background thread.
Backpressure handling -- The semaphore prevents overwhelming the GPU with too many concurrent generation tasks.

Metadata

Field	Value
Knowledge Sources	Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains	Model_Serving, Streaming_Inference
Last Updated	2026-02-13 14:00 GMT

Related Pages

Implementation:Haotian_liu_LLaVA_ModelWorker_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment