Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haotian liu LLaVA Model Worker Inference

From Leeroopedia

Overview

Server pattern for hosting a loaded model and providing streaming inference via HTTP endpoints.

Description

A model worker loads a LLaVA model into GPU memory and serves inference requests via FastAPI. This pattern encapsulates the full lifecycle of a single model serving instance within the distributed LLaVA architecture.

Key characteristics:

  • Model loading -- The worker loads a LLaVA model (including vision tower and projector) into GPU memory at startup using load_pretrained_model().
  • Streaming text generation -- Uses TextIteratorStreamer for real-time token delivery. Generation runs in a separate thread, writing tokens to a queue that the HTTP handler reads and yields as server-sent events.
  • Auto-registration -- On startup, the worker registers itself with the controller, providing its address, model name, and speed metadata.
  • Heartbeat maintenance -- A background thread sends heartbeats to the controller every 30 seconds to maintain registration.
  • Concurrency management -- An asyncio semaphore limits concurrent requests (default: 5 concurrent requests) to prevent GPU memory exhaustion.
  • Multimodal input handling -- The worker detects and processes base64-encoded images from incoming requests, converting them to tensors for the model.

Usage

Deploy one or more model workers behind a controller. Each worker hosts one model. Multiple workers can serve the same model (for replicas) or different models.

  • Use --load-8bit or --load-4bit for quantized inference on smaller GPUs.
  • Use --model-base for LoRA adapter models.
  • Use --use-flash-attn to enable Flash Attention 2 for faster inference.

Theoretical Basis

Streaming inference uses a separate generation thread that writes tokens to a TextIteratorStreamer queue. The HTTP handler reads from this queue and yields server-sent events. This architecture provides:

  • Sub-token-latency first-token delivery -- The user sees the first token as soon as it is generated, rather than waiting for the full response.
  • Non-blocking serving -- The asyncio event loop remains free to accept new requests while generation proceeds in a background thread.
  • Backpressure handling -- The semaphore prevents overwhelming the GPU with too many concurrent generation tasks.

Metadata

Field Value
Knowledge Sources Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains Model_Serving, Streaming_Inference
Last Updated 2026-02-13 14:00 GMT

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment