Principle:FlagOpen FlagEmbedding Embedder Training Configuration
Overview
A configuration system that defines model, data, and training hyperparameters for fine-tuning BGE embedding models using dataclass-based argument parsing.
Description
FlagEmbedding uses three dataclass groups:
- AbsEmbedderModelArguments
- Defines model_name_or_path and trust_remote_code for specifying the base model.
- AbsEmbedderDataArguments
- Defines train_data, train_group_size, query_max_len, passage_max_len, knowledge_distillation, and same_dataset_within_batch for controlling data loading and preprocessing.
- AbsEmbedderTrainingArguments
- Extends HuggingFace TrainingArguments with: temperature, negatives_cross_device, sentence_pooling_method, normalize_embeddings, and kd_loss_type.
DeepSpeed configs (ZeRO stage 0/1) enable distributed training.
Usage
Before running embedder fine-tuning, configure all three argument groups.
Theoretical Basis
Contrastive learning temperature controls the sharpness of the similarity distribution. Cross-device negatives share in-batch negatives across GPUs to increase effective batch size. Pooling method (cls/mean/last_token) determines how the embedding vector is extracted from the transformer output.