Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Query Based Vision Language Bridge

From Leeroopedia


Principle Name Query_Based_Vision_Language_Bridge
Domains Multimodal, Vision-Language Model, Cross-Attention
Last Updated 2026-02-07 14:00 GMT

Summary

Query-Based Vision-Language Bridge is the architectural pattern of using learnable query tokens processed through a language model with interleaved cross-attention layers to extract and compress visual features from a vision encoder. Instead of directly projecting vision encoder outputs through an MLP, this approach uses a set of trainable query embeddings that attend to visual features via cross-attention, producing a fixed-size set of vision-informed representations that can be seamlessly concatenated with text token embeddings for language model processing.

Motivation

Vision encoders like InternViT-6B produce high-dimensional feature sequences (e.g., hundreds of patch tokens at large hidden dimensions). Directly feeding all visual tokens to a language model is computationally expensive and may overwhelm the text context. The query-based bridge addresses this by using a small fixed number of learnable queries (e.g., 96) that selectively attend to relevant visual information through cross-attention, producing a compact visual representation suitable for the language model's input space.

Structure

The architecture consists of:

  • Learnable query tokens: A fixed set of trainable embeddings (e.g., 96 tokens of text_hidden_size dimensions) that serve as the initial queries.
  • Vision encoder: A frozen or LoRA-adapted vision transformer (e.g., InternViT) that produces visual feature sequences from input images.
  • Cross-attention layers: Interspersed within a modified language model (e.g., QLLaMA), these layers enable query tokens to attend to vision encoder outputs. Cross-attention is applied at regular intervals (controlled by cross_attention_frequency).
  • Self-attention layers: Standard causal self-attention in the language model processes query tokens alongside text tokens.
  • Attention pooling: Optional attention pooling blocks that further compress features for contrastive alignment.
  • LoRA adapters: Optional low-rank adaptation on both the vision backbone and the query language model for parameter-efficient fine-tuning.

Applicability

This principle applies when:

  • Building vision-language models that need to compress visual features efficiently
  • The vision encoder produces more tokens than desirable for the language model context
  • Fine-grained cross-modal interaction is needed beyond simple projection
  • Both contrastive (CLIP-style) and generative (captioning/VQA) capabilities are desired
  • The InternVL architecture family is being used (InternVL-14B, InternVL-C, InternVL-G)

Limitations

  • More complex than simple MLP projection bridges, requiring careful architectural coordination
  • Cross-attention layers add computational overhead compared to projection-only approaches
  • The number of query tokens is a critical hyperparameter affecting the information bottleneck
  • Requires specialized training procedures (e.g., separate LoRA configs for vision vs. language components)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment