Principle:FlagOpen FlagEmbedding Contrastive Embedding Training
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Information Retrieval, Contrastive Learning, Text Embeddings |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Bi-encoder contrastive learning for text embeddings that trains separate encoders for queries and documents using in-batch negatives and hard negative mining to create discriminative representations.
Description
This principle forms the foundation of the BGE (BAAI General Embedding) family of models. It employs a dual-encoder architecture where queries and documents are independently encoded into a shared embedding space. Training uses contrastive learning with InfoNCE loss, treating other examples in the batch as negatives. Hard negative mining identifies challenging negative examples that are semantically similar but incorrect, forcing the model to learn fine-grained distinctions. The approach scales efficiently to large datasets through in-batch sampling and supports various retrieval tasks (web search, question answering, semantic similarity) through multi-task training. The resulting embeddings can be compared with simple cosine similarity at inference time, enabling fast retrieval with approximate nearest neighbor search.
Usage
Use this principle when:
- Training general-purpose text embedding models
- Building bi-encoder retrieval systems for semantic search
- Creating embeddings for document clustering or similarity tasks
- Developing foundational retrieval models that can be fine-tuned for specific domains
Theoretical Basis
The contrastive training framework consists of:
- Dual Encoders:
- Query encoder: q = f_q(Q) where q ∈ R^d
- Document encoder: d = f_d(D) where d ∈ R^d
- Often f_q = f_d (shared encoder) for efficiency
- InfoNCE Loss:
- Similarity score: s(q, d) = (q · d) / (||q|| ||d||)
- Loss: L = -log(exp(s(q, d+)/τ) / (exp(s(q, d+)/τ) + Σ_i exp(s(q, d_i^-)/τ)))
- Where d+ is positive document, d_i^- are negatives, τ is temperature
- Hard Negative Mining:
- Retrieve challenging negatives: d^- = argmax_{d∈corpus, d≠d+} s(q, d)
- Mix with random negatives for balanced training
- Improves discriminative power
- Training Strategy:
- Use large batch sizes (hundreds/thousands) for diverse negatives
- Gradient accumulation for memory efficiency
- Warm-up learning rate schedule
- Evaluation: Measure recall@k, MRR, nDCG on retrieval benchmarks like MSMARCO, BEIR