Implementation:FlagOpen FlagEmbedding BGE Coder CoIR Eval
| Knowledge Sources | |
|---|---|
| Domains | Code Retrieval, Benchmark Evaluation, Information Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Evaluation script for running CoIR (Code Information Retrieval) benchmark tasks with FlagEmbedding models.
Description
This module provides a comprehensive evaluation framework for code retrieval models on the CoIR benchmark. It supports both encoder-only and decoder-only embedding models through FlagModel and FlagLLMModel wrappers, implements custom query and corpus encoding with special instruction handling, and computes NDCG@10 metrics across multiple code retrieval tasks including CodeSearchNet variants. The script handles task-specific instructions for improved retrieval performance, aggregates results across language-specific CodeSearchNet tasks, and saves detailed per-task and overall evaluation results in JSON format.
Usage
Use this script when evaluating embedding models on the CoIR code retrieval benchmark, comparing different model architectures (encoder-only vs decoder-only) on code search tasks, and generating comprehensive evaluation reports with NDCG metrics. The script is specifically designed for the BGE-Coder model evaluation pipeline.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_Coder/evaluation/coir_eval/main.py
- Lines: 1-167
Signature
def get_model(model_args: COIREvalModelArgs):
"""Initialize FlagModel or FlagLLMModel based on configuration"""
def main(
eval_args: COIREvalArgs,
model_args: COIREvalModelArgs
):
"""Run CoIR evaluation on specified tasks"""
class CustomFlagModel:
"""Wrapper for FlagModel with dict-based input handling"""
def encode_queries(self, queries, show_progress_bar, convert_to_tensor, **kwargs):
pass
def encode_corpus(self, corpus, show_progress_bar, convert_to_tensor, **kwargs):
pass
Import
from main import get_model, main
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| embedder_name_or_path | str | Yes | Model name or path for embedding |
| embedder_model_class | str | Yes | "encoder-only-base" or "decoder-only-base" |
| tasks | List[str] or str | Yes | CoIR task name(s) to evaluate |
| output_dir | str | Yes | Directory to save evaluation results |
| use_special_instructions | bool | No | Use task-specific instructions |
| embedder_batch_size | int | No | Batch size for encoding |
| normalize_embeddings | bool | No | Whether to normalize embeddings |
Outputs
| Name | Type | Description |
|---|---|---|
| task_results | dict | Per-task NDCG@10 scores saved as JSON files |
| overall_results | dict | Aggregated results including CodeSearchNet averages |
Usage Examples
# Example: Run CoIR evaluation from command line
# python main.py \
# --embedder_name_or_path BAAI/bge-base-en-v1.5 \
# --embedder_model_class encoder-only-base \
# --tasks CodeSearchNet-python CodeSearchNet-java \
# --output_dir ./results \
# --embedder_batch_size 256 \
# --use_special_instructions
# Example: Programmatic usage
from transformers import HfArgumentParser
from arguments import COIREvalArgs, COIREvalModelArgs
from main import main
# Setup arguments
eval_args = COIREvalArgs(
tasks=["CodeSearchNet-python"],
output_dir="./eval_results",
use_special_instructions=True
)
model_args = COIREvalModelArgs(
embedder_name_or_path="BAAI/bge-base-en-v1.5",
embedder_model_class="encoder-only-base",
embedder_batch_size=256,
normalize_embeddings=True
)
# Run evaluation
main(eval_args, model_args)
# Results will be saved to:
# - ./eval_results/bge-base-en-v1.5/CodeSearchNet-python.json
# - ./eval_results/bge-base-en-v1.5/OVERALL-results.json