Implementation:FlagOpen FlagEmbedding BGE VL Flag MMRet
| Knowledge Sources | |
|---|---|
| Domains | Vision-Language Embedding, Multimodal Retrieval, CLIP |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A unified encoder class for multimodal retrieval supporting text-only, image-only, and image-text queries and documents.
Description
The Flag_mmret class provides a flexible interface for encoding multimodal data using CLIP-based vision-language models. It supports three encoding modes: text-only encoding for text retrieval, image-only encoding for image retrieval, and multimodal (image+text) encoding for composed queries. The class handles automatic model loading, GPU acceleration with multi-GPU support, FP16 inference for efficiency, and normalized embeddings for cosine similarity. Embeddings are computed by combining text and image features from CLIP model, making it suitable for cross-modal retrieval tasks.
Usage
Use this class when building multimodal retrieval systems with BGE-VL, encoding images, text, or image-text pairs into a shared embedding space, and performing cross-modal search (text-to-image, image-to-text, or composed queries). The class is designed for both training and inference in vision-language retrieval applications.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_VL/eval/flag_mmret.py
- Lines: 1-206
Signature
class Flag_mmret(nn.Module):
def __init__(
self,
model_name: str = None,
normlized: bool = True,
pooling_method: str = 'cls',
use_fp16: bool=True,
image_dir: str = None,
) -> None:
pass
def encode_queries(
self,
queries: Union[List[str], str],
batch_size: int=256,
max_length: int=77,
query_type: str = None,
) -> np.ndarray:
"""Encode queries (text, image, or multimodal)"""
def encode_corpus(
self,
corpus: dict,
batch_size: int=256,
max_length: int=77,
corpus_type: str = None,
) -> np.ndarray:
"""Encode corpus (text, image, or multimodal)"""
def encode_text(self, sentences: Union[List[str], str], batch_size: int=256, max_length: int=77) -> np.ndarray:
"""Encode text-only"""
def encode_image(self, image_ids: Union[List[str], str], batch_size: int=256, max_length: int=77) -> np.ndarray:
"""Encode image-only"""
def encode_mm_it(self, captions: Union[List[str], str], image_ids: Union[List[str], str], batch_size: int=256, max_length: int=77) -> np.ndarray:
"""Encode multimodal image-text pairs"""
Import
from flag_mmret import Flag_mmret
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | Yes | BGE-VL model name or path |
| normlized | bool | No | Normalize embeddings (default: True) |
| use_fp16 | bool | No | Use FP16 inference (default: True) |
| image_dir | str | Yes | Directory containing images |
| queries | Union[List[str], str, List] | Yes | Query data (format depends on query_type) |
| query_type | str | Yes | "text", "image", or "mm_it" |
| corpus | dict | Yes | Corpus with keys "text", "image" depending on corpus_type |
| corpus_type | str | Yes | "text", "image", or "mm_it" |
| batch_size | int | No | Batch size (default: 256) |
Outputs
| Name | Type | Description |
|---|---|---|
| embeddings | np.ndarray | Normalized embeddings (N, D) where D is embedding dimension |
Usage Examples
# Example 1: Text-only encoding
from flag_mmret import Flag_mmret
model = Flag_mmret(
model_name="BAAI/BGE-VL-large",
normlized=True,
use_fp16=True,
image_dir="/path/to/images"
)
# Encode text queries
text_queries = ["A cat on a couch", "A dog in a park"]
text_embeddings = model.encode_queries(
queries=text_queries,
batch_size=32,
max_length=77,
query_type="text"
)
print(text_embeddings.shape) # (2, 768)
# Example 2: Image-only encoding
image_ids = ["img1.jpg", "img2.jpg", "img3.jpg"]
image_embeddings = model.encode_corpus(
corpus={"image": image_ids},
batch_size=64,
corpus_type="image"
)
print(image_embeddings.shape) # (3, 768)
# Example 3: Multimodal (image + text) encoding
captions = ["A red car", "A blue bicycle"]
image_ids = ["car.jpg", "bike.jpg"]
mm_embeddings = model.encode_queries(
queries=[captions, image_ids],
batch_size=32,
query_type="mm_it"
)
print(mm_embeddings.shape) # (2, 768)
# Example 4: Cross-modal retrieval
# Encode image corpus
corpus_images = ["doc1.jpg", "doc2.jpg", "doc3.jpg"]
corpus_emb = model.encode_corpus(
corpus={"image": corpus_images},
corpus_type="image"
)
# Encode text query
query_emb = model.encode_queries(
queries="A sunset over mountains",
query_type="text"
)
# Compute similarities
similarities = query_emb @ corpus_emb.T
print(f"Most similar image: {corpus_images[similarities.argmax()]}")