Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Finetune Data

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Natural_Language_Processing, Information_Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

Data loading and collation utilities for training BGE (BAAI General Embedding) models with contrastive learning.

Description

This module provides two key components for preparing training data for BGE embedding models:

TrainDatasetForEmbedding loads training data from JSON files containing query-passage pairs with positive and negative examples. It samples one positive passage and multiple negative passages per query, optionally adding instruction prefixes for retrieval-specific tasks. The dataset supports loading from single files or directories of files with configurable sampling limits per dataset.

EmbedCollator extends DataCollatorWithPadding to handle the conversion from query-passage tuples to separate batched inputs for the bi-encoder architecture. It tokenizes queries and passages separately with distinct maximum lengths and provides padding score functionality for distillation scenarios.

Usage

Use this for fine-tuning embedding models on retrieval tasks with contrastive learning, where each training example consists of a query with positive and negative passage examples.

Code Reference

Source Location

Signature

class TrainDatasetForEmbedding(Dataset):
    def __init__(self, args: DataArguments, tokenizer: PreTrainedTokenizer)
    def __getitem__(self, item) -> Tuple[str, List[str]]

@dataclass
class EmbedCollator(DataCollatorWithPadding):
    query_max_len: int = 32
    passage_max_len: int = 128
    def __call__(self, features)

Import

from research.baai_general_embedding.finetune.data import TrainDatasetForEmbedding, EmbedCollator

I/O Contract

Inputs

Name Type Required Description
args DataArguments Yes Configuration including train_data path, query/passage instructions, train_group_size
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer for encoding text
features List[Tuple] Yes List of (query, passages) tuples from dataset

Outputs

Name Type Description
query str Query text with optional instruction prefix
passages List[str] List of passages (1 positive + N-1 negatives) with optional instruction prefix
batch Dict Collated batch with 'query' and 'passage' tokenized tensors

Usage Examples

from transformers import AutoTokenizer
from research.baai_general_embedding.finetune.data import TrainDatasetForEmbedding, EmbedCollator
from research.baai_general_embedding.finetune.arguments import DataArguments

# Initialize dataset
args = DataArguments(
    train_data="path/to/train.json",
    train_group_size=8,
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: "
)
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-base-en-v1.5")
dataset = TrainDatasetForEmbedding(args, tokenizer)

# Initialize collator
collator = EmbedCollator(
    tokenizer=tokenizer,
    query_max_len=32,
    passage_max_len=128
)

# Get a sample
query, passages = dataset[0]
# query: "Represent this sentence for searching relevant passages: what is machine learning?"
# passages: [positive_passage, neg1, neg2, ...]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment