Principle:Haotian liu LLaVA Cloud Hosted Inference

Knowledge Sources	Haotian_liu_LLaVA Cog Documentation
Domains	Inference, Deployment, Vision_Language
Last Updated	2026-02-14 00:00 GMT

Overview

Architecture pattern that wraps a multimodal model as a cloud-hosted prediction endpoint with streaming output, enabling API-based inference without local GPU setup.

Description

Cloud hosted inference is a deployment pattern where a multimodal model is packaged as a containerized prediction service. The pattern involves three phases: (1) weight provisioning — downloading and caching model weights from a CDN on container startup for fast cold-start times; (2) model initialization — loading the model into GPU memory once during setup; and (3) streaming prediction — processing image+text inputs through the model pipeline and yielding generated tokens incrementally via a streaming iterator. This allows consumers to receive partial results in real-time rather than waiting for full generation to complete.

In the LLaVA context, this pattern wraps the standard inference pipeline (image preprocessing, conversation templating, tokenization, multimodal generation) behind a platform-specific predictor interface.

Usage

Use this principle when deploying LLaVA (or similar multimodal models) as a hosted API service. This is the preferred approach for making the model accessible to users who do not have local GPU resources. It is complementary to the local inference options (CLI chat, Gradio web demo) covered by other Principles.

Theoretical Basis

The cloud inference pattern follows a setup-once, predict-many architecture:

Pseudo-code Logic:

# Abstract cloud inference pattern (NOT real implementation)

class CloudPredictor:
    def setup():
        # Phase 1: Download weights from CDN (one-time)
        download_model_weights(cdn_url, local_cache)
        # Phase 2: Load model into GPU memory (one-time)
        model = load_model(local_cache)

    def predict(image, prompt, params) -> StreamingIterator:
        # Phase 3: Per-request inference (streaming)
        image_features = encode_image(image)
        input_tokens = build_prompt(prompt, image_features)
        for token in model.generate_stream(input_tokens, **params):
            yield token

The streaming component uses a background thread for generation coupled with a thread-safe iterator to yield tokens as they are produced.

Related Pages

Implementation:Haotian_liu_LLaVA_Predictor_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment