Implementation:FlagOpen FlagEmbedding BGE M3 Run
| Knowledge Sources | |
|---|---|
| Domains | Embedding Training, Multi-Vector Retrieval, Cross-Device Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Training script for BGE-M3 model with unified fine-tuning and cross-device negative sampling.
Description
This is the main training script for the BGE-M3 (Multi-Functionality, Multi-Linguality, Multi-Granularity) embedding model. It implements distributed training with cross-device negatives for contrastive learning, same-task batching for consistent training signal, unified fine-tuning supporting dense, sparse (lexical), and multi-vector (ColBERT) representations, self-distillation from teacher model, and dynamic data refresh at each epoch. The script uses custom BiTrainer with support for sub-batch processing, integrates with HuggingFace Transformers training infrastructure, and includes position embedding freezing options for length extrapolation.
Usage
Use this script when training the BGE-M3 model from scratch or fine-tuning on new data, implementing multi-representation retrieval systems (dense + sparse + multi-vector), and conducting distributed training experiments with cross-device negative sampling. The script is designed for large-scale embedding model training with advanced features.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/BGE_M3/run.py
- Lines: 1-155
Signature
def main():
"""Main training function for BGE-M3 model"""
class TrainerCallbackForDataRefresh(TrainerCallback):
def __init__(self, train_dataset):
pass
def on_epoch_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs):
"""Refresh dataset at end of each epoch"""
Import
# Run as script
# python run.py --model_name_or_path BAAI/bge-m3 --output_dir ./output
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | Pre-trained model name or path |
| train_data | str | Yes | Training data path (file or directory) |
| output_dir | str | Yes | Directory to save model checkpoints |
| negatives_cross_device | bool | No | Enable cross-device negative sampling |
| unified_finetuning | bool | No | Enable unified multi-representation training |
| use_self_distill | bool | No | Enable self-distillation |
| temperature | float | No | Temperature for contrastive loss |
| query_max_len | int | No | Maximum query length |
| passage_max_len | int | No | Maximum passage length |
| fix_position_embedding | bool | No | Freeze position embeddings |
Outputs
| Name | Type | Description |
|---|---|---|
| trained_model | Model | Fine-tuned BGE-M3 model saved to output_dir |
| tokenizer | Tokenizer | Saved tokenizer |
| training_logs | Logs | Training metrics and checkpoints |
Usage Examples
# Example 1: Basic training command
# python run.py \
# --model_name_or_path BAAI/bge-m3 \
# --train_data ./train_data \
# --output_dir ./output_m3 \
# --per_device_train_batch_size 32 \
# --learning_rate 1e-5 \
# --num_train_epochs 3 \
# --query_max_len 512 \
# --passage_max_len 512 \
# --temperature 0.02 \
# --negatives_cross_device \
# --unified_finetuning
# Example 2: Advanced training with self-distillation
# python run.py \
# --model_name_or_path BAAI/bge-m3 \
# --train_data ./train_data \
# --output_dir ./output_m3_distill \
# --negatives_cross_device \
# --unified_finetuning \
# --use_self_distill \
# --self_distill_start_step 1000 \
# --colbert_dim 1024 \
# --temperature 0.02 \
# --fix_position_embedding \
# --gradient_checkpointing
# Example 3: Multi-GPU distributed training
# torchrun --nproc_per_node 4 run.py \
# --model_name_or_path BAAI/bge-m3 \
# --train_data ./train_data \
# --output_dir ./output_m3_dist \
# --per_device_train_batch_size 64 \
# --negatives_cross_device \
# --unified_finetuning \
# --fp16 \
# --dataloader_num_workers 8