Implementation:FlagOpen FlagEmbedding BGE VL Eval FashionIQ
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Multi-Modal Retrieval, Fashion AI, Model Evaluation |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An evaluation script for measuring BGE-VL model performance on the FashionIQ benchmark, which tests composed image-text retrieval for fashion items.
Description
This script implements a comprehensive evaluation pipeline for the FashionIQ dataset, a challenging benchmark for composed image retrieval where the task is to find a target fashion item given a reference image and a text description of desired modifications. The evaluation covers three fashion categories (shirt, dress, toptee) and computes standard retrieval metrics (MRR and Recall at various cutoffs).
The script orchestrates the complete evaluation workflow including encoding the fashion image corpus using BGE-VL's visual encoder, building FAISS indexes for efficient similarity search, encoding multi-modal queries (reference image + text modification), retrieving top-k candidates, and computing metrics. It supports both CPU and multi-GPU inference, optional embedding persistence for faster re-evaluation, and configurable FAISS index types for speed/accuracy tradeoffs.
Key features include efficient batch processing with configurable batch sizes, memory-mapped embedding storage for large corpora, automatic handling of invalid indices (-1) in FAISS results, and comprehensive metric reporting including MRR@K and Recall@K for K in {1, 5, 10, 20, 50, 100}.
Usage
Use this script to evaluate BGE-VL or similar multi-modal retrieval models on the FashionIQ benchmark to measure their ability to understand and retrieve fashion items based on image-text queries.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_VL/eval/eval_fashioniq.py
- Lines: 1-342
Signature
def main()
def index(
model: Flag_mmret,
corpus: datasets.Dataset,
batch_size: int = 256,
max_length: int = 512,
index_factory: str = "Flat",
save_path: str = None,
save_embedding: bool = False,
load_embedding: bool = False
) -> faiss.Index
def search(
model: Flag_mmret,
queries: datasets,
faiss_index: faiss.Index,
k: int = 100,
batch_size: int = 256,
max_length: int = 512
) -> Tuple[np.ndarray, np.ndarray]
def evaluate(
preds: List[List[str]],
labels: List[List[str]],
cutoffs: List[int] = [1, 5, 10, 20, 50, 100]
) -> dict
Import
# Typically run as a script
# python eval_fashioniq.py --model_name BAAI/BGE-VL-large
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | Model checkpoint path (default: "BAAI/BGE-VL-large") |
| image_dir | str | No | Directory containing FashionIQ images |
| batch_size | int | No | Inference batch size (default: 256) |
| max_query_length | int | No | Maximum query length (default: 64) |
| max_passage_length | int | No | Maximum passage length (default: 77) |
| k | int | No | Number of neighbors to retrieve (default: 100) |
| index_factory | str | No | FAISS index type (default: "Flat") |
| fp16 | bool | No | Use FP16 inference (default: False) |
| save_embedding | bool | No | Save embeddings to disk (default: False) |
| load_embedding | bool | No | Load cached embeddings (default: False) |
| save_path | str | No | Path for embedding cache (default: "embeddings.memmap") |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics_shirt | dict | Metrics for shirt category (MRR@K, Recall@K) |
| metrics_dress | dict | Metrics for dress category |
| metrics_toptee | dict | Metrics for toptee category |
| overall_scores | tuple | Average Recall@10 and Recall@50 across categories |
FashionIQ Dataset
Task Description
Composed Image Retrieval: Given a reference fashion image and a text description of desired changes, retrieve the target fashion item that matches the modified description.
Example:
- Reference Image: A blue striped shirt
- Text Modification: "make it solid color and change to red"
- Target: A solid red shirt
Dataset Structure
Categories:
- shirt: Men's and women's shirts
- dress: Women's dresses
- toptee: Women's tops and t-shirts
Data Files:
- fashioniq_{category}_corpus.jsonl: Image corpus with "content" field (image filename)
- fashioniq_{category}_query_val.jsonl: Validation queries with:
* q_img: Reference image filename * q_text: Text modification description * positive_key: Target image filename(s)
Metrics
Mean Reciprocal Rank (MRR@K):
- Measures rank of first correct result
- MRR = 1/rank if rank ≤ K, else 0
- Averaged over all queries
Recall@K:
- Fraction of queries with correct result in top-K
- Measures retrieval coverage
Standard Cutoffs: K ∈ {1, 5, 10, 20, 50, 100}
Evaluation Pipeline
Step 1: Model Initialization
model = Flag_mmret(
model_name=args.model_name,
normlized=True,
image_dir=args.image_dir,
use_fp16=False
)
Step 2: Index Image Corpus
For each category (shirt, dress, toptee): 1. Load image corpus dataset 2. Encode images using model.encode_corpus(corpus_type='image') 3. Build FAISS index (Flat for exact search) 4. Optionally save embeddings to disk
Step 3: Encode Queries
1. Load query dataset with q_img, q_text, positive_key 2. Encode multi-modal queries: model.encode_queries([q_text, q_img], query_type='mm_it') 3. Multi-modal encoding combines image and text representations
Step 4: Search
1. Query FAISS index with encoded queries 2. Retrieve top-k candidates per query 3. Return scores and indices
Step 5: Evaluation
1. Convert indices to image filenames 2. Compare with ground truth positive_key 3. Compute MRR and Recall at all cutoffs 4. Report per-category and overall metrics
FAISS Index
Index Types
- Flat: Exact search, exhaustive comparison (default)
- IVF: Inverted file index, faster but approximate
- HNSW: Hierarchical navigable small world graph
- PQ: Product quantization for compression
GPU Acceleration
if model.device == torch.device("cuda"):
co = faiss.GpuMultipleClonerOptions()
co.useFloat16 = True
faiss_index = faiss.index_cpu_to_all_gpus(faiss_index, co)
Distributes index across all available GPUs for faster search.
Memory Management
- Embeddings stored in float32 for FAISS compatibility
- Optional memory-mapped storage for large corpora
- Batch processing to avoid OOM errors
Embedding Caching
Saving Embeddings
memmap = np.memmap(
save_path,
shape=corpus_embeddings.shape,
mode="w+",
dtype=corpus_embeddings.dtype
)
# Save in batches of 10000
for i in range(0, length, 10000):
memmap[i:i+10000] = corpus_embeddings[i:i+10000]
Loading Embeddings
corpus_embeddings = np.memmap(
save_path,
mode="r",
dtype=dtype
).reshape(-1, dim)
Avoids re-encoding when testing different search parameters.
Usage Examples
Basic Evaluation
python eval_fashioniq.py \
--model_name BAAI/BGE-VL-large \
--image_dir /path/to/fashioniq/images \
--batch_size 256 \
--k 100
With Embedding Cache
# First run: save embeddings
python eval_fashioniq.py \
--model_name BAAI/BGE-VL-large \
--image_dir /path/to/fashioniq/images \
--save_embedding \
--save_path ./fashioniq_embeddings.memmap
# Subsequent runs: load embeddings
python eval_fashioniq.py \
--model_name BAAI/BGE-VL-large \
--image_dir /path/to/fashioniq/images \
--load_embedding \
--save_path ./fashioniq_embeddings.memmap \
--k 50 # Test different k values quickly
FP16 Inference
python eval_fashioniq.py \
--model_name BAAI/BGE-VL-large \
--image_dir /path/to/fashioniq/images \
--fp16 \
--batch_size 512 # Larger batch with FP16
Approximate Search
python eval_fashioniq.py \
--model_name BAAI/BGE-VL-large \
--image_dir /path/to/fashioniq/images \
--index_factory "IVF1024,Flat" \
--k 100
Output Format
Console Output
FashionIQ tasks (shirt):
{'MRR@1': 0.1234, 'MRR@5': 0.2345, 'MRR@10': 0.2789, 'MRR@20': 0.3012, ...
'Recall@1': 0.1234, 'Recall@5': 0.3456, 'Recall@10': 0.4567, ...}
FashionIQ tasks (dress):
{'MRR@1': 0.1345, 'MRR@5': 0.2456, ...}
FashionIQ tasks (toptee):
{'MRR@1': 0.1456, 'MRR@5': 0.2567, ...}
shirt: 45.67 / 67.89
dress: 46.78 / 68.90
toptee: 47.89 / 69.01
overall: 46.78 / 68.60
Format: Category: Recall@10 / Recall@50
Metrics Dictionary
{
'MRR@1': float, # Mean reciprocal rank at top-1
'MRR@5': float, # Mean reciprocal rank at top-5
'MRR@10': float,
'MRR@20': float,
'MRR@50': float,
'MRR@100': float,
'Recall@1': float, # Recall at top-1
'Recall@5': float, # Recall at top-5
'Recall@10': float,
'Recall@20': float,
'Recall@50': float,
'Recall@100': float
}
Implementation Details
Metric Computation
MRR Calculation:
for pred, label in zip(preds, labels):
for i, x in enumerate(pred, 1): # Start from rank 1
if x in label:
for k, cutoff in enumerate(cutoffs):
if i <= cutoff:
mrrs[k] += 1 / i
break
mrrs /= len(preds)
Recall Calculation:
for pred, label in zip(preds, labels):
for k, cutoff in enumerate(cutoffs):
recall = len(set(pred[:cutoff]) & set(label)) / len(label)
recalls[k] += recall
recalls /= len(preds)
Invalid Index Handling
# Filter out invalid FAISS results
indice = indice[indice != -1].tolist()
retrieval_results.append(image_corpus[indice]["content"])
FAISS returns -1 for invalid indices when k > corpus_size.
Multi-Category Processing
Each category is evaluated independently: 1. Load category-specific corpus and queries 2. Build separate FAISS index 3. Perform retrieval 4. Compute metrics 5. Aggregate results
Performance Considerations
Batch Size
- Larger batch size → faster encoding
- Limited by GPU memory
- Typical values: 128-512
Index Type
- Flat: Exact, slow, best accuracy
- IVF: Fast, approximate, minor accuracy loss
- HNSW: Fastest, good accuracy for large corpus
GPU Utilization
- Multi-GPU supported via faiss.index_cpu_to_all_gpus
- FP16 reduces memory and increases speed
- Batch processing saturates GPU
Caching Strategy
- Save embeddings for large corpus (100k+ images)
- Reuse across multiple evaluation runs
- Trade disk space for computation time
Expected Results
Typical BGE-VL performance on FashionIQ validation:
- Recall@10: 40-50%
- Recall@50: 65-75%
- MRR@10: 25-35%
State-of-the-art models achieve:
- Recall@10: 45-55%
- Recall@50: 70-80%
- Troubleshooting ==
CUDA Out of Memory:
- Reduce batch_size
- Enable fp16
- Use smaller model
Slow Indexing:
- Use GPU acceleration
- Save embeddings for reuse
- Use approximate index
Poor Performance:
- Check image_dir path
- Verify data files format
- Ensure model loaded correctly