Principle:InternLM Lmdeploy VLM Configuration
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Configuration |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A configuration pattern that parameterizes vision-language model inference including image batch sizes and thread safety for multimodal processing.
Description
VLM Configuration addresses the unique requirements of vision-language models (VLMs) that process both image and text inputs. Key considerations include:
- Image batch size: Controls how many images can be processed simultaneously in the vision encoder
- Thread safety: Required when the pipeline is used in multi-threaded environments (e.g., API servers)
- Session length: Must be larger than text-only models to accommodate image token overhead (each image generates hundreds of tokens)
The VisionConfig is used alongside TurbomindEngineConfig or PytorchEngineConfig to configure both the vision and language components.
Usage
Use this when deploying vision-language models. Set session_len in the engine config large enough for image tokens (typically 8192+). Use VisionConfig.thread_safe=True when serving VLMs in API server mode.
Theoretical Basis
VLM processing follows a two-stage pipeline:
# Abstract VLM processing
image_tokens = vision_encoder(image) # Stage 1: extract visual features
text_tokens = tokenize(text_prompt)
combined = [text_tokens[:insert_pos], image_tokens, text_tokens[insert_pos:]]
output = language_model(combined) # Stage 2: generate text