Principle:AnswerDotAI RAGatouille Hard Negative Mining

Knowledge Sources	Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval RAGatouille
Domains	NLP, Information_Retrieval, Training, Negative_Sampling
Last Updated	2026-02-12 12:00 GMT

Overview

A training data augmentation technique that uses dense embedding models and approximate nearest-neighbor search to identify challenging negative examples for contrastive learning in retrieval models.

Description

Hard Negative Mining improves retrieval model training by providing more informative negative examples. Instead of random negatives (which are trivially distinguishable from positives), hard negatives are documents that are semantically similar to the query but not actually relevant. These challenging examples force the model to learn finer-grained distinctions between relevant and irrelevant documents.

The approach uses a two-stage pipeline:

Embedding: A pre-trained dense embedding model (e.g., BGE, GTE, E5) encodes all documents into fixed-size vectors
ANN Search: A Voyager approximate nearest-neighbor index enables fast retrieval of similar documents for each query
Rank Filtering: Documents ranked between min_rank (typically 10) and max_rank (~110) are selected as hard negatives, avoiding both trivially easy negatives and potential false negatives in the top ranks

The technique supports multiple languages through language-specific embedding models and scales to large corpora via multi-process encoding and ANN indexing.

Usage

Use this principle when preparing training data for ColBERT fine-tuning. Hard negative mining is most beneficial when:

Training data consists of positive-only pairs (no provided negatives)
The document collection is large enough to contain semantically similar but irrelevant passages
Higher retrieval quality is desired, at the cost of additional data preparation time

For very small datasets, random negative sampling may be sufficient.

Theoretical Basis

Why Hard Negatives Matter:

In contrastive learning, the gradient signal from a negative example is proportional to its similarity to the positive:

$\nabla_{θ} ℒ \propto \sum_{d^{-} \in N} \frac{e^{S (q, d^{-})}}{Z} \nabla_{θ} S (q, d^{-})$

Random negatives have low similarity scores and thus contribute weak gradients. Hard negatives have higher similarity scores and contribute stronger, more informative gradients.

Rank-Based Selection:

Selecting negatives from ranks 10-110 (rather than top-1) avoids:

False negatives: Top-ranked documents may actually be relevant but unlabeled
Trivial negatives: Low-ranked documents are too easy to distinguish
The "sweet spot" provides challenging but reliable negative examples

Multi-Language Support:

Language-specific embedding models are selected based on a language code:

English: BAAI/bge (small/base/large)
Chinese: thenlper/gte-zh
French: OrdalieTech/Solon
Other: intfloat/multilingual-e5

Related Pages

Implemented By

Implementation:AnswerDotAI_RAGatouille_SimpleMiner_Mine_Hard_Negatives

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment