Principle:OpenGVLab InternVL Query Based Vision Language Bridge

Principle Name	Query_Based_Vision_Language_Bridge
Domains	Multimodal, Vision-Language Model, Cross-Attention
Last Updated	2026-02-07 14:00 GMT

Summary

Query-Based Vision-Language Bridge is the architectural pattern of using learnable query tokens processed through a language model with interleaved cross-attention layers to extract and compress visual features from a vision encoder. Instead of directly projecting vision encoder outputs through an MLP, this approach uses a set of trainable query embeddings that attend to visual features via cross-attention, producing a fixed-size set of vision-informed representations that can be seamlessly concatenated with text token embeddings for language model processing.

Motivation

Vision encoders like InternViT-6B produce high-dimensional feature sequences (e.g., hundreds of patch tokens at large hidden dimensions). Directly feeding all visual tokens to a language model is computationally expensive and may overwhelm the text context. The query-based bridge addresses this by using a small fixed number of learnable queries (e.g., 96) that selectively attend to relevant visual information through cross-attention, producing a compact visual representation suitable for the language model's input space.

Structure

The architecture consists of:

Learnable query tokens: A fixed set of trainable embeddings (e.g., 96 tokens of text_hidden_size dimensions) that serve as the initial queries.
Vision encoder: A frozen or LoRA-adapted vision transformer (e.g., InternViT) that produces visual feature sequences from input images.
Cross-attention layers: Interspersed within a modified language model (e.g., QLLaMA), these layers enable query tokens to attend to vision encoder outputs. Cross-attention is applied at regular intervals (controlled by cross_attention_frequency).
Self-attention layers: Standard causal self-attention in the language model processes query tokens alongside text tokens.
Attention pooling: Optional attention pooling blocks that further compress features for contrastive alignment.
LoRA adapters: Optional low-rank adaptation on both the vision backbone and the query language model for parameter-efficient fine-tuning.

Applicability

This principle applies when:

Building vision-language models that need to compress visual features efficiently
The vision encoder produces more tokens than desirable for the language model context
Fine-grained cross-modal interaction is needed beyond simple projection
Both contrastive (CLIP-style) and generative (captioning/VQA) capabilities are desired
The InternVL architecture family is being used (InternVL-14B, InternVL-C, InternVL-G)

Limitations

More complex than simple MLP projection bridges, requiring careful architectural coordination
Cross-attention layers add computational overhead compared to projection-only approaches
The number of query tokens is a critical hyperparameter affecting the information bottleneck
Requires specialized training procedures (e.g., separate LoRA configs for vision vs. language components)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment