Principle:FlagOpen FlagEmbedding Training Data Preparation
Overview
A data formatting standard for preparing contrastive training data with query-positive-negative triplets in JSONL format for embedding and reranker fine-tuning.
Description
FlagEmbedding uses JSONL files where each line is a JSON object with:
query(str) — the query textpos(List[str]) — list of positive passagesneg(List[str]) — list of negative passages
Optional fields:
pos_scoresandneg_scores(List[float]) — for knowledge distillationprompt(str) — for ICL embedders
This format is consumed by AbsEmbedderTrainDataset and AbsRerankerTrainDataset.
Usage
Before fine-tuning any BGE embedder or reranker. Required as first step of data pipeline.
Theoretical Basis
Contrastive learning requires positive and negative examples per query. The training loss (InfoNCE/contrastive) pushes query embeddings closer to positives and away from negatives. Knowledge distillation scores provide soft labels from a teacher model.
Related Pages
Implementation:FlagOpen_FlagEmbedding_Training_Data_JSONL_Format