Principle:FlagOpen FlagEmbedding Contrastive Embedding Training

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Machine Learning, Information Retrieval, Contrastive Learning, Text Embeddings
Last Updated	2026-02-09 00:00 GMT

Overview

Bi-encoder contrastive learning for text embeddings that trains separate encoders for queries and documents using in-batch negatives and hard negative mining to create discriminative representations.

Description

This principle forms the foundation of the BGE (BAAI General Embedding) family of models. It employs a dual-encoder architecture where queries and documents are independently encoded into a shared embedding space. Training uses contrastive learning with InfoNCE loss, treating other examples in the batch as negatives. Hard negative mining identifies challenging negative examples that are semantically similar but incorrect, forcing the model to learn fine-grained distinctions. The approach scales efficiently to large datasets through in-batch sampling and supports various retrieval tasks (web search, question answering, semantic similarity) through multi-task training. The resulting embeddings can be compared with simple cosine similarity at inference time, enabling fast retrieval with approximate nearest neighbor search.

Usage

Use this principle when:

Training general-purpose text embedding models
Building bi-encoder retrieval systems for semantic search
Creating embeddings for document clustering or similarity tasks
Developing foundational retrieval models that can be fine-tuned for specific domains

Theoretical Basis

The contrastive training framework consists of:

Dual Encoders:

- Query encoder: q = f_q(Q) where q ∈ R^d
- Document encoder: d = f_d(D) where d ∈ R^d
- Often f_q = f_d (shared encoder) for efficiency

InfoNCE Loss:

- Similarity score: s(q, d) = (q · d) / (||q|| ||d||)
- Loss: L = -log(exp(s(q, d+)/τ) / (exp(s(q, d+)/τ) + Σ_i exp(s(q, d_i^-)/τ)))
- Where d+ is positive document, d_i^- are negatives, τ is temperature

Hard Negative Mining:

- Retrieve challenging negatives: d^- = argmax_{d∈corpus, d≠d+} s(q, d)
- Mix with random negatives for balanced training
- Improves discriminative power

Training Strategy:

- Use large batch sizes (hundreds/thousands) for diverse negatives
- Gradient accumulation for memory efficiency
- Warm-up learning rate schedule

Evaluation: Measure recall@k, MRR, nDCG on retrieval benchmarks like MSMARCO, BEIR

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment