Principle:FlagOpen FlagEmbedding Embedder Training Configuration

Overview

A configuration system that defines model, data, and training hyperparameters for fine-tuning BGE embedding models using dataclass-based argument parsing.

Description

FlagEmbedding uses three dataclass groups:

AbsEmbedderModelArguments: Defines model_name_or_path and trust_remote_code for specifying the base model.

AbsEmbedderDataArguments: Defines train_data, train_group_size, query_max_len, passage_max_len, knowledge_distillation, and same_dataset_within_batch for controlling data loading and preprocessing.

AbsEmbedderTrainingArguments: Extends HuggingFace TrainingArguments with: temperature, negatives_cross_device, sentence_pooling_method, normalize_embeddings, and kd_loss_type.

DeepSpeed configs (ZeRO stage 0/1) enable distributed training.

Usage

Before running embedder fine-tuning, configure all three argument groups.

Theoretical Basis

Contrastive learning temperature controls the sharpness of the similarity distribution. Cross-device negatives share in-batch negatives across GPUs to increase effective batch size. Pooling method (cls/mean/last_token) determines how the embedding vector is extracted from the transformer output.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment