Implementation:FlagOpen FlagEmbedding BGE Finetune Data
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Natural_Language_Processing, Information_Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Data loading and collation utilities for training BGE (BAAI General Embedding) models with contrastive learning.
Description
This module provides two key components for preparing training data for BGE embedding models:
TrainDatasetForEmbedding loads training data from JSON files containing query-passage pairs with positive and negative examples. It samples one positive passage and multiple negative passages per query, optionally adding instruction prefixes for retrieval-specific tasks. The dataset supports loading from single files or directories of files with configurable sampling limits per dataset.
EmbedCollator extends DataCollatorWithPadding to handle the conversion from query-passage tuples to separate batched inputs for the bi-encoder architecture. It tokenizes queries and passages separately with distinct maximum lengths and provides padding score functionality for distillation scenarios.
Usage
Use this for fine-tuning embedding models on retrieval tasks with contrastive learning, where each training example consists of a query with positive and negative passage examples.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/baai_general_embedding/finetune/data.py
- Lines: 1-114
Signature
class TrainDatasetForEmbedding(Dataset):
def __init__(self, args: DataArguments, tokenizer: PreTrainedTokenizer)
def __getitem__(self, item) -> Tuple[str, List[str]]
@dataclass
class EmbedCollator(DataCollatorWithPadding):
query_max_len: int = 32
passage_max_len: int = 128
def __call__(self, features)
Import
from research.baai_general_embedding.finetune.data import TrainDatasetForEmbedding, EmbedCollator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | DataArguments | Yes | Configuration including train_data path, query/passage instructions, train_group_size |
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer for encoding text |
| features | List[Tuple] | Yes | List of (query, passages) tuples from dataset |
Outputs
| Name | Type | Description |
|---|---|---|
| query | str | Query text with optional instruction prefix |
| passages | List[str] | List of passages (1 positive + N-1 negatives) with optional instruction prefix |
| batch | Dict | Collated batch with 'query' and 'passage' tokenized tensors |
Usage Examples
from transformers import AutoTokenizer
from research.baai_general_embedding.finetune.data import TrainDatasetForEmbedding, EmbedCollator
from research.baai_general_embedding.finetune.arguments import DataArguments
# Initialize dataset
args = DataArguments(
train_data="path/to/train.json",
train_group_size=8,
query_instruction_for_retrieval="Represent this sentence for searching relevant passages: "
)
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-base-en-v1.5")
dataset = TrainDatasetForEmbedding(args, tokenizer)
# Initialize collator
collator = EmbedCollator(
tokenizer=tokenizer,
query_max_len=32,
passage_max_len=128
)
# Get a sample
query, passages = dataset[0]
# query: "Represent this sentence for searching relevant passages: what is machine learning?"
# passages: [positive_passage, neg1, neg2, ...]