Principle:Haotian liu LLaVA Cloud Hosted Inference
| Knowledge Sources | |
|---|---|
| Domains | Inference, Deployment, Vision_Language |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Architecture pattern that wraps a multimodal model as a cloud-hosted prediction endpoint with streaming output, enabling API-based inference without local GPU setup.
Description
Cloud hosted inference is a deployment pattern where a multimodal model is packaged as a containerized prediction service. The pattern involves three phases: (1) weight provisioning — downloading and caching model weights from a CDN on container startup for fast cold-start times; (2) model initialization — loading the model into GPU memory once during setup; and (3) streaming prediction — processing image+text inputs through the model pipeline and yielding generated tokens incrementally via a streaming iterator. This allows consumers to receive partial results in real-time rather than waiting for full generation to complete.
In the LLaVA context, this pattern wraps the standard inference pipeline (image preprocessing, conversation templating, tokenization, multimodal generation) behind a platform-specific predictor interface.
Usage
Use this principle when deploying LLaVA (or similar multimodal models) as a hosted API service. This is the preferred approach for making the model accessible to users who do not have local GPU resources. It is complementary to the local inference options (CLI chat, Gradio web demo) covered by other Principles.
Theoretical Basis
The cloud inference pattern follows a setup-once, predict-many architecture:
Pseudo-code Logic:
# Abstract cloud inference pattern (NOT real implementation)
class CloudPredictor:
def setup():
# Phase 1: Download weights from CDN (one-time)
download_model_weights(cdn_url, local_cache)
# Phase 2: Load model into GPU memory (one-time)
model = load_model(local_cache)
def predict(image, prompt, params) -> StreamingIterator:
# Phase 3: Per-request inference (streaming)
image_features = encode_image(image)
input_tokens = build_prompt(prompt, image_features)
for token in model.generate_stream(input_tokens, **params):
yield token
The streaming component uses a background thread for generation coupled with a thread-safe iterator to yield tokens as they are produced.