Workflow:InternLM Lmdeploy VLM Inference Pipeline

Knowledge Sources	LMDeploy LMDeploy Docs VLM Pipeline Guide
Domains	LLM_Ops, Inference, Multi_Modal
Last Updated	2026-02-07 15:00 GMT

Overview

End-to-end process for performing inference on Vision-Language Models (VLMs) using the LMDeploy pipeline API to process combined image and text inputs.

Description

This workflow covers loading a pre-trained Vision-Language Model and generating text responses to prompts that include both images and text. LMDeploy supports 20+ VLM families including InternVL, Qwen-VL, LLaVA, DeepSeek-VL, CogVLM, Phi-3-Vision, and Gemma3. The pipeline handles image preprocessing (resizing, normalization, tiling), vision encoder execution, visual token injection into the language model context, and autoregressive text generation. It supports single images, multiple images, batch prompts, multi-turn conversations, and OpenAI-format message structures.

Usage

Execute this workflow when you need to analyze images with natural language queries, such as image description, visual question answering, document understanding, or multi-modal reasoning tasks. The VLM pipeline extends the LLM pipeline with image processing capabilities while maintaining the same interface patterns. Requires a GPU with sufficient VRAM to hold both the vision encoder and language model (typically 16-48GB depending on model size).

Execution Steps

Step 1: Environment Setup

Install LMDeploy and the VLM-specific dependencies. Each upstream VLM model may require additional packages (e.g., transformers, timm, or model-specific preprocessing libraries). LMDeploy does not bundle these dependencies, so install them as prompted by ImportError messages when loading specific models.

Key considerations:

Install lmdeploy via pip as the base package
VLM dependencies vary per model family; install as needed
For PyTorch backend VLMs, install triton>=2.1.0

Step 2: Model Selection and Configuration

Choose the target VLM by specifying a HuggingFace model ID (e.g., OpenGVLab/InternVL2_5-8B) or local path. Configure the backend engine with appropriate session length (VLMs consume more tokens due to image tokens) and memory settings. Optionally configure VisionConfig to control image processing batch size.

Key considerations:

VLMs require larger session_len than text-only models due to image token overhead
Increasing VisionConfig.max_batch_size risks OOM because the LLM pre-allocates KV cache
Select backend based on the supported-models matrix for VLMs
Some VLMs are only supported on the PyTorch backend

Step 3: Pipeline Initialization

Create the VLM inference pipeline using the same pipeline factory function as LLM inference. The pipeline detects the VLM architecture, loads both the vision encoder and language model, initializes the multimodal processor, and sets up the chat template with image token placement rules.

What happens:

Model architecture is detected as a VLM variant
Vision encoder weights are loaded alongside the language model
Multimodal processor is initialized for image preprocessing
Chat template is configured with image token insertion rules
KV cache is pre-allocated accounting for vision token budget

Step 4: Image Loading and Prompt Preparation

Load images from URLs, local file paths, or PIL Image objects using the load_image utility. Construct prompts as tuples of (text, image) or (text, [images]) for multi-image scenarios. Alternatively, use OpenAI-format message dictionaries with image_url content blocks. For batch inference, collect multiple image-text prompts into a list.

Key considerations:

Use lmdeploy.vl.load_image() to load images from URLs or paths
Multi-image prompts pass a list of images as the second tuple element
OpenAI format uses type='image_url' content blocks
Custom image token placement is available via the IMAGE_TOKEN constant for models that support it

Step 5: Inference Execution

Invoke the pipeline with the prepared prompts and optional GenerationConfig. The pipeline preprocesses images through the vision encoder, generates visual tokens, injects them into the language model context at the appropriate positions (determined by the chat template), then runs autoregressive text generation. For multi-turn conversations, use pipe.chat() to maintain session state.

What happens:

Images are resized, normalized, and encoded by the vision model
Visual features are projected into the language model's embedding space
Image tokens are inserted at template-defined positions in the prompt
The language model generates text conditioned on both visual and textual context
Multi-turn sessions preserve conversation history and KV cache state

Step 6: Result Processing and Cleanup

Extract generated text from Response objects. For streaming output, iterate over pipe.stream_infer() results. Release GPU resources (both vision encoder and language model memory) by closing the pipeline. Use torch.cuda.empty_cache() and gc.collect() for thorough memory cleanup when needed.

Key considerations:

Response format is identical to LLM pipeline (text, token counts, finish_reason)
Logits output is supported via GenerationConfig.output_logits
Use context manager (with statement) for automatic cleanup
Clear torch cache explicitly if reusing GPU for another model

Execution Diagram

GitHub URL

Workflow Repository