Principle:OpenGVLab InternVL Query Based Vision Language Bridge
| Principle Name | Query_Based_Vision_Language_Bridge |
|---|---|
| Domains | Multimodal, Vision-Language Model, Cross-Attention |
| Last Updated | 2026-02-07 14:00 GMT |
Summary
Query-Based Vision-Language Bridge is the architectural pattern of using learnable query tokens processed through a language model with interleaved cross-attention layers to extract and compress visual features from a vision encoder. Instead of directly projecting vision encoder outputs through an MLP, this approach uses a set of trainable query embeddings that attend to visual features via cross-attention, producing a fixed-size set of vision-informed representations that can be seamlessly concatenated with text token embeddings for language model processing.
Motivation
Vision encoders like InternViT-6B produce high-dimensional feature sequences (e.g., hundreds of patch tokens at large hidden dimensions). Directly feeding all visual tokens to a language model is computationally expensive and may overwhelm the text context. The query-based bridge addresses this by using a small fixed number of learnable queries (e.g., 96) that selectively attend to relevant visual information through cross-attention, producing a compact visual representation suitable for the language model's input space.
Structure
The architecture consists of:
- Learnable query tokens: A fixed set of trainable embeddings (e.g., 96 tokens of text_hidden_size dimensions) that serve as the initial queries.
- Vision encoder: A frozen or LoRA-adapted vision transformer (e.g., InternViT) that produces visual feature sequences from input images.
- Cross-attention layers: Interspersed within a modified language model (e.g., QLLaMA), these layers enable query tokens to attend to vision encoder outputs. Cross-attention is applied at regular intervals (controlled by cross_attention_frequency).
- Self-attention layers: Standard causal self-attention in the language model processes query tokens alongside text tokens.
- Attention pooling: Optional attention pooling blocks that further compress features for contrastive alignment.
- LoRA adapters: Optional low-rank adaptation on both the vision backbone and the query language model for parameter-efficient fine-tuning.
Applicability
This principle applies when:
- Building vision-language models that need to compress visual features efficiently
- The vision encoder produces more tokens than desirable for the language model context
- Fine-grained cross-modal interaction is needed beyond simple projection
- Both contrastive (CLIP-style) and generative (captioning/VQA) capabilities are desired
- The InternVL architecture family is being used (InternVL-14B, InternVL-C, InternVL-G)
Limitations
- More complex than simple MLP projection bridges, requiring careful architectural coordination
- Cross-attention layers add computational overhead compared to projection-only approaches
- The number of query tokens is a critical hyperparameter affecting the information bottleneck
- Requires specialized training procedures (e.g., separate LoRA configs for vision vs. language components)