Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:FlagOpen FlagEmbedding Embedder Finetuning

From Leeroopedia
Revision as of 11:04, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/FlagOpen_FlagEmbedding_Embedder_Finetuning.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Text_Embeddings, Fine_Tuning, Information_Retrieval
Last Updated 2026-02-09 21:30 GMT

Overview

End-to-end process for fine-tuning a BGE embedding model on custom data, from data preparation through hard negative mining, optional knowledge distillation, and distributed training.

Description

This workflow covers the complete pipeline for adapting BGE embedding models to domain-specific retrieval tasks. It supports four model families: encoder-only base models (bge-large-en-v1.5), M3 multi-functional models (bge-m3 with dense+sparse+ColBERT), LLM-based decoder-only models (bge-multilingual-gemma2 with LoRA), and in-context learning models (bge-en-icl with LoRA). The pipeline includes formatting training data as JSONL with query/pos/neg triplets, mining hard negatives using FAISS retrieval, optionally generating teacher scores for knowledge distillation, and running distributed training with DeepSpeed ZeRO.

Usage

Execute this workflow when you have a domain-specific dataset of queries and relevant passages, and need to adapt a pre-trained BGE embedder for better retrieval performance in your domain. Common scenarios include vertical search engines, enterprise document retrieval, and custom RAG pipelines where general-purpose embeddings underperform.

Execution Steps

Step 1: Install FlagEmbedding with Finetune Dependencies

Install the FlagEmbedding package with the finetune extras, which include DeepSpeed, flash-attention, and training utilities.

Key considerations:

  • Use pip install -U FlagEmbedding[finetune]
  • Install deepspeed and flash-attn separately if pip fails
  • Multi-GPU training requires NCCL backend

Step 2: Prepare Training Data

Format training data as JSONL files where each line contains a query string, a list of positive passages, and a list of negative passages. Optionally include teacher scores for knowledge distillation and prompt/type fields for ICL models.

Data format: Each line: {"query": str, "pos": List[str], "neg": List[str]}

Key considerations:

  • The neg field can be omitted initially if hard negatives will be mined
  • pos_scores and neg_scores are needed only for knowledge distillation
  • The prompt field overrides query_instruction_for_retrieval per-example
  • The type field (normal, symmetric_class, etc.) is used for ICL models

Step 3: Mine Hard Negatives

Use the hn_mine.py script to retrieve top-k passages for each query using an existing embedder, then sample negatives from the retrieved set (excluding positives). This produces challenging negatives that improve training signal.

Key considerations:

  • range_for_sampling controls negative difficulty (e.g., 2-200 for hard, 60-300 for easier)
  • negative_number sets how many negatives to sample per query
  • A candidate_pool can be provided as an alternative retrieval corpus
  • GPU-accelerated FAISS search is supported via use_gpu_for_searching

Step 4: Generate Teacher Scores (Optional)

Use the add_reranker_score.py script to annotate each query-passage pair with a cross-encoder relevance score from a teacher reranker model. These scores enable knowledge distillation during training, where the student embedder learns to match the teacher's ranking.

Key considerations:

  • Uses a BGE reranker (e.g., bge-reranker-v2-m3) as the teacher
  • Scores are added as pos_scores and neg_scores in the JSONL
  • Enable knowledge_distillation=True during training to use these scores
  • Multi-device inference is supported for faster scoring

Step 5: Configure Training

Set up training parameters including model selection, DeepSpeed stage, learning rate, batch size, and embedding-specific settings like pooling method, temperature, and cross-device negative sharing.

Key considerations:

  • Select the appropriate module: encoder_only.base, encoder_only.m3, decoder_only.base, or decoder_only.icl
  • DeepSpeed ZeRO Stage 0 suffices for encoder-only; Stage 1 recommended for LLM-based models
  • For LLM-based models, enable LoRA (use_lora=True) with appropriate rank and alpha
  • same_dataset_within_batch ensures in-batch negatives come from the same domain
  • negatives_cross_device shares negatives across GPUs for larger effective batch size

Step 6: Run Distributed Training

Launch training using torchrun with the FlagEmbedding finetune module. Training uses contrastive learning with in-batch negatives, hard negatives, and optional knowledge distillation loss. The trainer extends HuggingFace Trainer with custom loss computation and gradient checkpointing.

Key considerations:

  • Launch with torchrun --nproc_per_node N -m FlagEmbedding.finetune.embedder.{type}
  • The contrastive loss uses temperature-scaled softmax over query-passage similarities
  • KD loss type can be kl_div (standard) or m3_kd_loss (for M3 unified finetuning)
  • For M3, enable unified_finetuning and use_self_distill for multi-retrieval training
  • For LoRA models, set save_merged_lora_model=True to save the merged weights

Step 7: Validate the Fine-tuned Model

Load the fine-tuned model checkpoint using FlagAutoModel.from_finetuned() and verify it produces reasonable embeddings on held-out queries and passages. Compare similarity distributions before and after fine-tuning.

Key considerations:

  • The output directory contains checkpoint folders with model weights
  • Use the same query_instruction_for_retrieval as during training
  • For LoRA models, use the merged model path if save_merged_lora_model was enabled

Execution Diagram

GitHub URL

Workflow Repository