Principle:InternLM Lmdeploy Multimodal Inference
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Multimodal_AI |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A multimodal inference pattern that combines image and text inputs for vision-language model generation with support for multi-turn conversations and multiple images.
Description
Multimodal Inference extends the standard text generation pipeline with vision capabilities:
- Single-turn: Pass image-text tuples to Pipeline.__call__() for one-shot VLM inference
- Multi-turn chat: Use Pipeline.chat() with session objects for multi-turn conversations with images
- Multiple images: Pass lists of images in a single prompt for multi-image understanding
- OpenAI format: Support for OpenAI-style messages with image_url content parts
The VLM pipeline auto-detects vision-language models during initialization and uses VLAsyncEngine instead of the standard AsyncEngine, enabling vision encoder preprocessing.
Usage
Use this when performing inference on vision-language models. Format prompts as tuples (text, image) or (text, [image1, image2]) for the callable interface, or use OpenAI-format messages with image_url content for the chat interface.
Theoretical Basis
Multimodal inference combines visual and textual representations:
# Abstract multimodal inference
def vlm_infer(text, images):
# 1. Encode images through vision encoder
visual_tokens = [vision_encoder(img) for img in images]
# 2. Tokenize text and insert visual tokens
text_tokens = tokenizer(text)
combined = interleave(text_tokens, visual_tokens, positions)
# 3. Run language model
output = language_model.generate(combined)
return output