Principle:FlagOpen FlagEmbedding Training Data Preparation

Overview

A data formatting standard for preparing contrastive training data with query-positive-negative triplets in JSONL format for embedding and reranker fine-tuning.

Description

FlagEmbedding uses JSONL files where each line is a JSON object with:

query (str) — the query text
pos (List[str]) — list of positive passages
neg (List[str]) — list of negative passages

Optional fields:

pos_scores and neg_scores (List[float]) — for knowledge distillation
prompt (str) — for ICL embedders

This format is consumed by AbsEmbedderTrainDataset and AbsRerankerTrainDataset.

Usage

Before fine-tuning any BGE embedder or reranker. Required as first step of data pipeline.

Theoretical Basis

Contrastive learning requires positive and negative examples per query. The training loss (InfoNCE/contrastive) pushes query embeddings closer to positives and away from negatives. Knowledge distillation scores provide soft labels from a teacher model.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment