Implementation:FlagOpen FlagEmbedding BGE VL Eval FashionIQ

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Computer Vision, Multi-Modal Retrieval, Fashion AI, Model Evaluation
Last Updated	2026-02-09 00:00 GMT

Overview

An evaluation script for measuring BGE-VL model performance on the FashionIQ benchmark, which tests composed image-text retrieval for fashion items.

Description

This script implements a comprehensive evaluation pipeline for the FashionIQ dataset, a challenging benchmark for composed image retrieval where the task is to find a target fashion item given a reference image and a text description of desired modifications. The evaluation covers three fashion categories (shirt, dress, toptee) and computes standard retrieval metrics (MRR and Recall at various cutoffs).

The script orchestrates the complete evaluation workflow including encoding the fashion image corpus using BGE-VL's visual encoder, building FAISS indexes for efficient similarity search, encoding multi-modal queries (reference image + text modification), retrieving top-k candidates, and computing metrics. It supports both CPU and multi-GPU inference, optional embedding persistence for faster re-evaluation, and configurable FAISS index types for speed/accuracy tradeoffs.

Key features include efficient batch processing with configurable batch sizes, memory-mapped embedding storage for large corpora, automatic handling of invalid indices (-1) in FAISS results, and comprehensive metric reporting including MRR@K and Recall@K for K in {1, 5, 10, 20, 50, 100}.

Usage

Use this script to evaluate BGE-VL or similar multi-modal retrieval models on the FashionIQ benchmark to measure their ability to understand and retrieve fashion items based on image-text queries.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/BGE_VL/eval/eval_fashioniq.py
Lines: 1-342

Signature

def main()

def index(
    model: Flag_mmret,
    corpus: datasets.Dataset,
    batch_size: int = 256,
    max_length: int = 512,
    index_factory: str = "Flat",
    save_path: str = None,
    save_embedding: bool = False,
    load_embedding: bool = False
) -> faiss.Index

def search(
    model: Flag_mmret,
    queries: datasets,
    faiss_index: faiss.Index,
    k: int = 100,
    batch_size: int = 256,
    max_length: int = 512
) -> Tuple[np.ndarray, np.ndarray]

def evaluate(
    preds: List[List[str]],
    labels: List[List[str]],
    cutoffs: List[int] = [1, 5, 10, 20, 50, 100]
) -> dict

Import

# Typically run as a script
# python eval_fashioniq.py --model_name BAAI/BGE-VL-large

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	No	Model checkpoint path (default: "BAAI/BGE-VL-large")
image_dir	str	No	Directory containing FashionIQ images
batch_size	int	No	Inference batch size (default: 256)
max_query_length	int	No	Maximum query length (default: 64)
max_passage_length	int	No	Maximum passage length (default: 77)
k	int	No	Number of neighbors to retrieve (default: 100)
index_factory	str	No	FAISS index type (default: "Flat")
fp16	bool	No	Use FP16 inference (default: False)
save_embedding	bool	No	Save embeddings to disk (default: False)
load_embedding	bool	No	Load cached embeddings (default: False)
save_path	str	No	Path for embedding cache (default: "embeddings.memmap")

Outputs

Name	Type	Description
metrics_shirt	dict	Metrics for shirt category (MRR@K, Recall@K)
metrics_dress	dict	Metrics for dress category
metrics_toptee	dict	Metrics for toptee category
overall_scores	tuple	Average Recall@10 and Recall@50 across categories

FashionIQ Dataset

Task Description

Composed Image Retrieval: Given a reference fashion image and a text description of desired changes, retrieve the target fashion item that matches the modified description.

Example:

Reference Image: A blue striped shirt
Text Modification: "make it solid color and change to red"
Target: A solid red shirt

Dataset Structure

Categories:

shirt: Men's and women's shirts
dress: Women's dresses
toptee: Women's tops and t-shirts

Data Files:

fashioniq_{category}_corpus.jsonl: Image corpus with "content" field (image filename)
fashioniq_{category}_query_val.jsonl: Validation queries with:

 * q_img: Reference image filename
 * q_text: Text modification description
 * positive_key: Target image filename(s)

Metrics

Mean Reciprocal Rank (MRR@K):

Measures rank of first correct result
MRR = 1/rank if rank ≤ K, else 0
Averaged over all queries

Recall@K:

Fraction of queries with correct result in top-K
Measures retrieval coverage

Standard Cutoffs: K ∈ {1, 5, 10, 20, 50, 100}

Evaluation Pipeline

Step 1: Model Initialization

model = Flag_mmret(
    model_name=args.model_name,
    normlized=True,
    image_dir=args.image_dir,
    use_fp16=False
)

Step 2: Index Image Corpus

For each category (shirt, dress, toptee): 1. Load image corpus dataset 2. Encode images using model.encode_corpus(corpus_type='image') 3. Build FAISS index (Flat for exact search) 4. Optionally save embeddings to disk

Step 3: Encode Queries

1. Load query dataset with q_img, q_text, positive_key 2. Encode multi-modal queries: model.encode_queries([q_text, q_img], query_type='mm_it') 3. Multi-modal encoding combines image and text representations

Step 4: Search

1. Query FAISS index with encoded queries 2. Retrieve top-k candidates per query 3. Return scores and indices

Step 5: Evaluation

1. Convert indices to image filenames 2. Compare with ground truth positive_key 3. Compute MRR and Recall at all cutoffs 4. Report per-category and overall metrics

FAISS Index

Index Types

Flat: Exact search, exhaustive comparison (default)
IVF: Inverted file index, faster but approximate
HNSW: Hierarchical navigable small world graph
PQ: Product quantization for compression

GPU Acceleration

if model.device == torch.device("cuda"):
    co = faiss.GpuMultipleClonerOptions()
    co.useFloat16 = True
    faiss_index = faiss.index_cpu_to_all_gpus(faiss_index, co)

Distributes index across all available GPUs for faster search.

Memory Management

Embeddings stored in float32 for FAISS compatibility
Optional memory-mapped storage for large corpora
Batch processing to avoid OOM errors

Embedding Caching

Saving Embeddings

memmap = np.memmap(
    save_path,
    shape=corpus_embeddings.shape,
    mode="w+",
    dtype=corpus_embeddings.dtype
)
# Save in batches of 10000
for i in range(0, length, 10000):
    memmap[i:i+10000] = corpus_embeddings[i:i+10000]

Loading Embeddings

corpus_embeddings = np.memmap(
    save_path,
    mode="r",
    dtype=dtype
).reshape(-1, dim)

Avoids re-encoding when testing different search parameters.

Usage Examples

Basic Evaluation

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --batch_size 256 \
    --k 100

With Embedding Cache

# First run: save embeddings
python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --save_embedding \
    --save_path ./fashioniq_embeddings.memmap

# Subsequent runs: load embeddings
python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --load_embedding \
    --save_path ./fashioniq_embeddings.memmap \
    --k 50  # Test different k values quickly

FP16 Inference

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --fp16 \
    --batch_size 512  # Larger batch with FP16

Approximate Search

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --index_factory "IVF1024,Flat" \
    --k 100

Output Format

Console Output

FashionIQ tasks (shirt):
{'MRR@1': 0.1234, 'MRR@5': 0.2345, 'MRR@10': 0.2789, 'MRR@20': 0.3012, ...
 'Recall@1': 0.1234, 'Recall@5': 0.3456, 'Recall@10': 0.4567, ...}

FashionIQ tasks (dress):
{'MRR@1': 0.1345, 'MRR@5': 0.2456, ...}

FashionIQ tasks (toptee):
{'MRR@1': 0.1456, 'MRR@5': 0.2567, ...}

shirt: 45.67 / 67.89
dress: 46.78 / 68.90
toptee: 47.89 / 69.01
overall: 46.78 / 68.60

Format: Category: Recall@10 / Recall@50

Metrics Dictionary

{
    'MRR@1': float,   # Mean reciprocal rank at top-1
    'MRR@5': float,   # Mean reciprocal rank at top-5
    'MRR@10': float,
    'MRR@20': float,
    'MRR@50': float,
    'MRR@100': float,
    'Recall@1': float,   # Recall at top-1
    'Recall@5': float,   # Recall at top-5
    'Recall@10': float,
    'Recall@20': float,
    'Recall@50': float,
    'Recall@100': float
}

Implementation Details

Metric Computation

MRR Calculation:

for pred, label in zip(preds, labels):
    for i, x in enumerate(pred, 1):  # Start from rank 1
        if x in label:
            for k, cutoff in enumerate(cutoffs):
                if i <= cutoff:
                    mrrs[k] += 1 / i
            break
mrrs /= len(preds)

Recall Calculation:

for pred, label in zip(preds, labels):
    for k, cutoff in enumerate(cutoffs):
        recall = len(set(pred[:cutoff]) & set(label)) / len(label)
        recalls[k] += recall
recalls /= len(preds)

Invalid Index Handling

# Filter out invalid FAISS results
indice = indice[indice != -1].tolist()
retrieval_results.append(image_corpus[indice]["content"])

FAISS returns -1 for invalid indices when k > corpus_size.

Multi-Category Processing

Each category is evaluated independently: 1. Load category-specific corpus and queries 2. Build separate FAISS index 3. Perform retrieval 4. Compute metrics 5. Aggregate results

Performance Considerations

Batch Size

Larger batch size → faster encoding
Limited by GPU memory
Typical values: 128-512

Index Type

Flat: Exact, slow, best accuracy
IVF: Fast, approximate, minor accuracy loss
HNSW: Fastest, good accuracy for large corpus

GPU Utilization

Multi-GPU supported via faiss.index_cpu_to_all_gpus
FP16 reduces memory and increases speed
Batch processing saturates GPU

Caching Strategy

Save embeddings for large corpus (100k+ images)
Reuse across multiple evaluation runs
Trade disk space for computation time

Expected Results

Typical BGE-VL performance on FashionIQ validation:

Recall@10: 40-50%
Recall@50: 65-75%
MRR@10: 25-35%

State-of-the-art models achieve:

Recall@10: 45-55%
Recall@50: 70-80%

1. Troubleshooting ==

CUDA Out of Memory:

Reduce batch_size
Enable fp16
Use smaller model

Slow Indexing:

Use GPU acceleration
Save embeddings for reuse
Use approximate index

Poor Performance:

Check image_dir path
Verify data files format
Ensure model loaded correctly

Related Pages

Principle:FlagOpen_FlagEmbedding_Multimodal_Retrieval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment