Principle:FlagOpen FlagEmbedding Multimodal Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Computer Vision, Multimodal Learning, Information Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vision-language retrieval that combines image and text modalities through contrastive learning to enable cross-modal search and composed image retrieval tasks.
Description
This principle extends text-only retrieval to multimodal scenarios where queries and documents can contain both visual and textual information. The approach uses dual encoders (vision and text) with aligned embedding spaces, enabling retrieval across modalities. Key applications include composed image retrieval (finding images based on a reference image plus text modification), fashion search (FashionIQ), and contextual image retrieval (CIRCO). The architecture leverages CLIP-style contrastive pretraining followed by task-specific fine-tuning. The system handles complex queries that combine visual references with textual constraints, such as "find a dress like this but in red."
Usage
Use this principle when:
- Building image search systems that accept text queries
- Implementing composed image retrieval for e-commerce or fashion
- Creating cross-modal retrieval systems for multimedia databases
- Developing vision-language models for content discovery
Theoretical Basis
The multimodal retrieval framework follows these components:
- Dual Encoders: Image encoder f_v(I) → v ∈ R^d and text encoder f_t(T) → t ∈ R^d
- Contrastive Learning: L = -log(exp(sim(v_i, t_i)/τ) / Σ_j exp(sim(v_i, t_j)/τ))
- Composed Retrieval: For query (I_ref, T_mod), compute: h = g(f_v(I_ref), f_t(T_mod)) where g is a composition function
- Cross-Modal Alignment: Maximize similarity between matched image-text pairs while minimizing similarity with negatives
Common composition strategies:
- Addition: h = v + t
- Gated fusion: h = α*v + (1-α)*t where α = σ(W[v;t])
- Attention-based: h = Attention(v, t)
The goal is to create a unified embedding space where semantically related images and texts have high similarity regardless of modality.