Workflow:InternLM Lmdeploy VLM Inference Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Inference, Multi_Modal |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
End-to-end process for performing inference on Vision-Language Models (VLMs) using the LMDeploy pipeline API to process combined image and text inputs.
Description
This workflow covers loading a pre-trained Vision-Language Model and generating text responses to prompts that include both images and text. LMDeploy supports 20+ VLM families including InternVL, Qwen-VL, LLaVA, DeepSeek-VL, CogVLM, Phi-3-Vision, and Gemma3. The pipeline handles image preprocessing (resizing, normalization, tiling), vision encoder execution, visual token injection into the language model context, and autoregressive text generation. It supports single images, multiple images, batch prompts, multi-turn conversations, and OpenAI-format message structures.
Usage
Execute this workflow when you need to analyze images with natural language queries, such as image description, visual question answering, document understanding, or multi-modal reasoning tasks. The VLM pipeline extends the LLM pipeline with image processing capabilities while maintaining the same interface patterns. Requires a GPU with sufficient VRAM to hold both the vision encoder and language model (typically 16-48GB depending on model size).
Execution Steps
Step 1: Environment Setup
Install LMDeploy and the VLM-specific dependencies. Each upstream VLM model may require additional packages (e.g., transformers, timm, or model-specific preprocessing libraries). LMDeploy does not bundle these dependencies, so install them as prompted by ImportError messages when loading specific models.
Key considerations:
- Install lmdeploy via pip as the base package
- VLM dependencies vary per model family; install as needed
- For PyTorch backend VLMs, install triton>=2.1.0
Step 2: Model Selection and Configuration
Choose the target VLM by specifying a HuggingFace model ID (e.g., OpenGVLab/InternVL2_5-8B) or local path. Configure the backend engine with appropriate session length (VLMs consume more tokens due to image tokens) and memory settings. Optionally configure VisionConfig to control image processing batch size.
Key considerations:
- VLMs require larger session_len than text-only models due to image token overhead
- Increasing VisionConfig.max_batch_size risks OOM because the LLM pre-allocates KV cache
- Select backend based on the supported-models matrix for VLMs
- Some VLMs are only supported on the PyTorch backend
Step 3: Pipeline Initialization
Create the VLM inference pipeline using the same pipeline factory function as LLM inference. The pipeline detects the VLM architecture, loads both the vision encoder and language model, initializes the multimodal processor, and sets up the chat template with image token placement rules.
What happens:
- Model architecture is detected as a VLM variant
- Vision encoder weights are loaded alongside the language model
- Multimodal processor is initialized for image preprocessing
- Chat template is configured with image token insertion rules
- KV cache is pre-allocated accounting for vision token budget
Step 4: Image Loading and Prompt Preparation
Load images from URLs, local file paths, or PIL Image objects using the load_image utility. Construct prompts as tuples of (text, image) or (text, [images]) for multi-image scenarios. Alternatively, use OpenAI-format message dictionaries with image_url content blocks. For batch inference, collect multiple image-text prompts into a list.
Key considerations:
- Use lmdeploy.vl.load_image() to load images from URLs or paths
- Multi-image prompts pass a list of images as the second tuple element
- OpenAI format uses type='image_url' content blocks
- Custom image token placement is available via the IMAGE_TOKEN constant for models that support it
Step 5: Inference Execution
Invoke the pipeline with the prepared prompts and optional GenerationConfig. The pipeline preprocesses images through the vision encoder, generates visual tokens, injects them into the language model context at the appropriate positions (determined by the chat template), then runs autoregressive text generation. For multi-turn conversations, use pipe.chat() to maintain session state.
What happens:
- Images are resized, normalized, and encoded by the vision model
- Visual features are projected into the language model's embedding space
- Image tokens are inserted at template-defined positions in the prompt
- The language model generates text conditioned on both visual and textual context
- Multi-turn sessions preserve conversation history and KV cache state
Step 6: Result Processing and Cleanup
Extract generated text from Response objects. For streaming output, iterate over pipe.stream_infer() results. Release GPU resources (both vision encoder and language model memory) by closing the pipeline. Use torch.cuda.empty_cache() and gc.collect() for thorough memory cleanup when needed.
Key considerations:
- Response format is identical to LLM pipeline (text, token counts, finish_reason)
- Logits output is supported via GenerationConfig.output_logits
- Use context manager (with statement) for automatic cleanup
- Clear torch cache explicitly if reusing GPU for another model