Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FlagOpen FlagEmbedding Training Data Preparation

From Leeroopedia
Revision as of 17:36, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/FlagOpen_FlagEmbedding_Training_Data_Preparation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Metadata

Overview

A data formatting standard for preparing contrastive training data with query-positive-negative triplets in JSONL format for embedding and reranker fine-tuning.

Description

FlagEmbedding uses JSONL files where each line is a JSON object with:

  • query (str) — the query text
  • pos (List[str]) — list of positive passages
  • neg (List[str]) — list of negative passages

Optional fields:

  • pos_scores and neg_scores (List[float]) — for knowledge distillation
  • prompt (str) — for ICL embedders

This format is consumed by AbsEmbedderTrainDataset and AbsRerankerTrainDataset.

Usage

Before fine-tuning any BGE embedder or reranker. Required as first step of data pipeline.

Theoretical Basis

Contrastive learning requires positive and negative examples per query. The training loss (InfoNCE/contrastive) pushes query embeddings closer to positives and away from negatives. Knowledge distillation scores provide soft labels from a teacher model.

Related Pages

Implementation:FlagOpen_FlagEmbedding_Training_Data_JSONL_Format

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment