Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE VL Eval FashionIQ

From Leeroopedia


Knowledge Sources
Domains Computer Vision, Multi-Modal Retrieval, Fashion AI, Model Evaluation
Last Updated 2026-02-09 00:00 GMT

Overview

An evaluation script for measuring BGE-VL model performance on the FashionIQ benchmark, which tests composed image-text retrieval for fashion items.

Description

This script implements a comprehensive evaluation pipeline for the FashionIQ dataset, a challenging benchmark for composed image retrieval where the task is to find a target fashion item given a reference image and a text description of desired modifications. The evaluation covers three fashion categories (shirt, dress, toptee) and computes standard retrieval metrics (MRR and Recall at various cutoffs).

The script orchestrates the complete evaluation workflow including encoding the fashion image corpus using BGE-VL's visual encoder, building FAISS indexes for efficient similarity search, encoding multi-modal queries (reference image + text modification), retrieving top-k candidates, and computing metrics. It supports both CPU and multi-GPU inference, optional embedding persistence for faster re-evaluation, and configurable FAISS index types for speed/accuracy tradeoffs.

Key features include efficient batch processing with configurable batch sizes, memory-mapped embedding storage for large corpora, automatic handling of invalid indices (-1) in FAISS results, and comprehensive metric reporting including MRR@K and Recall@K for K in {1, 5, 10, 20, 50, 100}.

Usage

Use this script to evaluate BGE-VL or similar multi-modal retrieval models on the FashionIQ benchmark to measure their ability to understand and retrieve fashion items based on image-text queries.

Code Reference

Source Location

Signature

def main()

def index(
    model: Flag_mmret,
    corpus: datasets.Dataset,
    batch_size: int = 256,
    max_length: int = 512,
    index_factory: str = "Flat",
    save_path: str = None,
    save_embedding: bool = False,
    load_embedding: bool = False
) -> faiss.Index

def search(
    model: Flag_mmret,
    queries: datasets,
    faiss_index: faiss.Index,
    k: int = 100,
    batch_size: int = 256,
    max_length: int = 512
) -> Tuple[np.ndarray, np.ndarray]

def evaluate(
    preds: List[List[str]],
    labels: List[List[str]],
    cutoffs: List[int] = [1, 5, 10, 20, 50, 100]
) -> dict

Import

# Typically run as a script
# python eval_fashioniq.py --model_name BAAI/BGE-VL-large

I/O Contract

Inputs

Name Type Required Description
model_name str No Model checkpoint path (default: "BAAI/BGE-VL-large")
image_dir str No Directory containing FashionIQ images
batch_size int No Inference batch size (default: 256)
max_query_length int No Maximum query length (default: 64)
max_passage_length int No Maximum passage length (default: 77)
k int No Number of neighbors to retrieve (default: 100)
index_factory str No FAISS index type (default: "Flat")
fp16 bool No Use FP16 inference (default: False)
save_embedding bool No Save embeddings to disk (default: False)
load_embedding bool No Load cached embeddings (default: False)
save_path str No Path for embedding cache (default: "embeddings.memmap")

Outputs

Name Type Description
metrics_shirt dict Metrics for shirt category (MRR@K, Recall@K)
metrics_dress dict Metrics for dress category
metrics_toptee dict Metrics for toptee category
overall_scores tuple Average Recall@10 and Recall@50 across categories

FashionIQ Dataset

Task Description

Composed Image Retrieval: Given a reference fashion image and a text description of desired changes, retrieve the target fashion item that matches the modified description.

Example:

  • Reference Image: A blue striped shirt
  • Text Modification: "make it solid color and change to red"
  • Target: A solid red shirt

Dataset Structure

Categories:

  • shirt: Men's and women's shirts
  • dress: Women's dresses
  • toptee: Women's tops and t-shirts

Data Files:

  • fashioniq_{category}_corpus.jsonl: Image corpus with "content" field (image filename)
  • fashioniq_{category}_query_val.jsonl: Validation queries with:
 * q_img: Reference image filename
 * q_text: Text modification description
 * positive_key: Target image filename(s)

Metrics

Mean Reciprocal Rank (MRR@K):

  • Measures rank of first correct result
  • MRR = 1/rank if rank ≤ K, else 0
  • Averaged over all queries

Recall@K:

  • Fraction of queries with correct result in top-K
  • Measures retrieval coverage

Standard Cutoffs: K ∈ {1, 5, 10, 20, 50, 100}

Evaluation Pipeline

Step 1: Model Initialization

model = Flag_mmret(
    model_name=args.model_name,
    normlized=True,
    image_dir=args.image_dir,
    use_fp16=False
)

Step 2: Index Image Corpus

For each category (shirt, dress, toptee): 1. Load image corpus dataset 2. Encode images using model.encode_corpus(corpus_type='image') 3. Build FAISS index (Flat for exact search) 4. Optionally save embeddings to disk

Step 3: Encode Queries

1. Load query dataset with q_img, q_text, positive_key 2. Encode multi-modal queries: model.encode_queries([q_text, q_img], query_type='mm_it') 3. Multi-modal encoding combines image and text representations

Step 4: Search

1. Query FAISS index with encoded queries 2. Retrieve top-k candidates per query 3. Return scores and indices

Step 5: Evaluation

1. Convert indices to image filenames 2. Compare with ground truth positive_key 3. Compute MRR and Recall at all cutoffs 4. Report per-category and overall metrics

FAISS Index

Index Types

  • Flat: Exact search, exhaustive comparison (default)
  • IVF: Inverted file index, faster but approximate
  • HNSW: Hierarchical navigable small world graph
  • PQ: Product quantization for compression

GPU Acceleration

if model.device == torch.device("cuda"):
    co = faiss.GpuMultipleClonerOptions()
    co.useFloat16 = True
    faiss_index = faiss.index_cpu_to_all_gpus(faiss_index, co)

Distributes index across all available GPUs for faster search.

Memory Management

  • Embeddings stored in float32 for FAISS compatibility
  • Optional memory-mapped storage for large corpora
  • Batch processing to avoid OOM errors

Embedding Caching

Saving Embeddings

memmap = np.memmap(
    save_path,
    shape=corpus_embeddings.shape,
    mode="w+",
    dtype=corpus_embeddings.dtype
)
# Save in batches of 10000
for i in range(0, length, 10000):
    memmap[i:i+10000] = corpus_embeddings[i:i+10000]

Loading Embeddings

corpus_embeddings = np.memmap(
    save_path,
    mode="r",
    dtype=dtype
).reshape(-1, dim)

Avoids re-encoding when testing different search parameters.

Usage Examples

Basic Evaluation

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --batch_size 256 \
    --k 100

With Embedding Cache

# First run: save embeddings
python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --save_embedding \
    --save_path ./fashioniq_embeddings.memmap

# Subsequent runs: load embeddings
python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --load_embedding \
    --save_path ./fashioniq_embeddings.memmap \
    --k 50  # Test different k values quickly

FP16 Inference

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --fp16 \
    --batch_size 512  # Larger batch with FP16

Approximate Search

python eval_fashioniq.py \
    --model_name BAAI/BGE-VL-large \
    --image_dir /path/to/fashioniq/images \
    --index_factory "IVF1024,Flat" \
    --k 100

Output Format

Console Output

FashionIQ tasks (shirt):
{'MRR@1': 0.1234, 'MRR@5': 0.2345, 'MRR@10': 0.2789, 'MRR@20': 0.3012, ...
 'Recall@1': 0.1234, 'Recall@5': 0.3456, 'Recall@10': 0.4567, ...}

FashionIQ tasks (dress):
{'MRR@1': 0.1345, 'MRR@5': 0.2456, ...}

FashionIQ tasks (toptee):
{'MRR@1': 0.1456, 'MRR@5': 0.2567, ...}

shirt: 45.67 / 67.89
dress: 46.78 / 68.90
toptee: 47.89 / 69.01
overall: 46.78 / 68.60

Format: Category: Recall@10 / Recall@50

Metrics Dictionary

{
    'MRR@1': float,   # Mean reciprocal rank at top-1
    'MRR@5': float,   # Mean reciprocal rank at top-5
    'MRR@10': float,
    'MRR@20': float,
    'MRR@50': float,
    'MRR@100': float,
    'Recall@1': float,   # Recall at top-1
    'Recall@5': float,   # Recall at top-5
    'Recall@10': float,
    'Recall@20': float,
    'Recall@50': float,
    'Recall@100': float
}

Implementation Details

Metric Computation

MRR Calculation:

for pred, label in zip(preds, labels):
    for i, x in enumerate(pred, 1):  # Start from rank 1
        if x in label:
            for k, cutoff in enumerate(cutoffs):
                if i <= cutoff:
                    mrrs[k] += 1 / i
            break
mrrs /= len(preds)

Recall Calculation:

for pred, label in zip(preds, labels):
    for k, cutoff in enumerate(cutoffs):
        recall = len(set(pred[:cutoff]) & set(label)) / len(label)
        recalls[k] += recall
recalls /= len(preds)

Invalid Index Handling

# Filter out invalid FAISS results
indice = indice[indice != -1].tolist()
retrieval_results.append(image_corpus[indice]["content"])

FAISS returns -1 for invalid indices when k > corpus_size.

Multi-Category Processing

Each category is evaluated independently: 1. Load category-specific corpus and queries 2. Build separate FAISS index 3. Perform retrieval 4. Compute metrics 5. Aggregate results

Performance Considerations

Batch Size

  • Larger batch size → faster encoding
  • Limited by GPU memory
  • Typical values: 128-512

Index Type

  • Flat: Exact, slow, best accuracy
  • IVF: Fast, approximate, minor accuracy loss
  • HNSW: Fastest, good accuracy for large corpus

GPU Utilization

  • Multi-GPU supported via faiss.index_cpu_to_all_gpus
  • FP16 reduces memory and increases speed
  • Batch processing saturates GPU

Caching Strategy

  • Save embeddings for large corpus (100k+ images)
  • Reuse across multiple evaluation runs
  • Trade disk space for computation time

Expected Results

Typical BGE-VL performance on FashionIQ validation:

  • Recall@10: 40-50%
  • Recall@50: 65-75%
  • MRR@10: 25-35%

State-of-the-art models achieve:

  • Recall@10: 45-55%
  • Recall@50: 70-80%
    1. Troubleshooting ==

CUDA Out of Memory:

  • Reduce batch_size
  • Enable fp16
  • Use smaller model

Slow Indexing:

  • Use GPU acceleration
  • Save embeddings for reuse
  • Use approximate index

Poor Performance:

  • Check image_dir path
  • Verify data files format
  • Ensure model loaded correctly

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment