Principle:InternLM Lmdeploy Multimodal Inference

Knowledge Sources	VLM Pipeline LMDeploy
Domains	Vision_Language_Models, Multimodal_AI
Last Updated	2026-02-07 15:00 GMT

Overview

A multimodal inference pattern that combines image and text inputs for vision-language model generation with support for multi-turn conversations and multiple images.

Description

Multimodal Inference extends the standard text generation pipeline with vision capabilities:

Single-turn: Pass image-text tuples to Pipeline.__call__() for one-shot VLM inference
Multi-turn chat: Use Pipeline.chat() with session objects for multi-turn conversations with images
Multiple images: Pass lists of images in a single prompt for multi-image understanding
OpenAI format: Support for OpenAI-style messages with image_url content parts

The VLM pipeline auto-detects vision-language models during initialization and uses VLAsyncEngine instead of the standard AsyncEngine, enabling vision encoder preprocessing.

Usage

Use this when performing inference on vision-language models. Format prompts as tuples (text, image) or (text, [image1, image2]) for the callable interface, or use OpenAI-format messages with image_url content for the chat interface.

Theoretical Basis

Multimodal inference combines visual and textual representations:

# Abstract multimodal inference
def vlm_infer(text, images):
    # 1. Encode images through vision encoder
    visual_tokens = [vision_encoder(img) for img in images]

    # 2. Tokenize text and insert visual tokens
    text_tokens = tokenizer(text)
    combined = interleave(text_tokens, visual_tokens, positions)

    # 3. Run language model
    output = language_model.generate(combined)
    return output

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Pipeline_Chat_VLM

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment