Principle:Haotian liu LLaVA Model Worker Inference
Overview
Server pattern for hosting a loaded model and providing streaming inference via HTTP endpoints.
Description
A model worker loads a LLaVA model into GPU memory and serves inference requests via FastAPI. This pattern encapsulates the full lifecycle of a single model serving instance within the distributed LLaVA architecture.
Key characteristics:
- Model loading -- The worker loads a LLaVA model (including vision tower and projector) into GPU memory at startup using
load_pretrained_model(). - Streaming text generation -- Uses
TextIteratorStreamerfor real-time token delivery. Generation runs in a separate thread, writing tokens to a queue that the HTTP handler reads and yields as server-sent events. - Auto-registration -- On startup, the worker registers itself with the controller, providing its address, model name, and speed metadata.
- Heartbeat maintenance -- A background thread sends heartbeats to the controller every 30 seconds to maintain registration.
- Concurrency management -- An asyncio semaphore limits concurrent requests (default: 5 concurrent requests) to prevent GPU memory exhaustion.
- Multimodal input handling -- The worker detects and processes base64-encoded images from incoming requests, converting them to tensors for the model.
Usage
Deploy one or more model workers behind a controller. Each worker hosts one model. Multiple workers can serve the same model (for replicas) or different models.
- Use
--load-8bitor--load-4bitfor quantized inference on smaller GPUs. - Use
--model-basefor LoRA adapter models. - Use
--use-flash-attnto enable Flash Attention 2 for faster inference.
Theoretical Basis
Streaming inference uses a separate generation thread that writes tokens to a TextIteratorStreamer queue. The HTTP handler reads from this queue and yields server-sent events. This architecture provides:
- Sub-token-latency first-token delivery -- The user sees the first token as soon as it is generated, rather than waiting for the full response.
- Non-blocking serving -- The asyncio event loop remains free to accept new requests while generation proceeds in a background thread.
- Backpressure handling -- The semaphore prevents overwhelming the GPU with too many concurrent generation tasks.
Metadata
| Field | Value |
|---|---|
| Knowledge Sources | Repo - LLaVA - https://github.com/haotian-liu/LLaVA |
| Domains | Model_Serving, Streaming_Inference |
| Last Updated | 2026-02-13 14:00 GMT |