Principle:AnswerDotAI RAGatouille Hard Negative Mining
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Training, Negative_Sampling |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
A training data augmentation technique that uses dense embedding models and approximate nearest-neighbor search to identify challenging negative examples for contrastive learning in retrieval models.
Description
Hard Negative Mining improves retrieval model training by providing more informative negative examples. Instead of random negatives (which are trivially distinguishable from positives), hard negatives are documents that are semantically similar to the query but not actually relevant. These challenging examples force the model to learn finer-grained distinctions between relevant and irrelevant documents.
The approach uses a two-stage pipeline:
- Embedding: A pre-trained dense embedding model (e.g., BGE, GTE, E5) encodes all documents into fixed-size vectors
- ANN Search: A Voyager approximate nearest-neighbor index enables fast retrieval of similar documents for each query
- Rank Filtering: Documents ranked between min_rank (typically 10) and max_rank (~110) are selected as hard negatives, avoiding both trivially easy negatives and potential false negatives in the top ranks
The technique supports multiple languages through language-specific embedding models and scales to large corpora via multi-process encoding and ANN indexing.
Usage
Use this principle when preparing training data for ColBERT fine-tuning. Hard negative mining is most beneficial when:
- Training data consists of positive-only pairs (no provided negatives)
- The document collection is large enough to contain semantically similar but irrelevant passages
- Higher retrieval quality is desired, at the cost of additional data preparation time
For very small datasets, random negative sampling may be sufficient.
Theoretical Basis
Why Hard Negatives Matter:
In contrastive learning, the gradient signal from a negative example is proportional to its similarity to the positive:
Random negatives have low similarity scores and thus contribute weak gradients. Hard negatives have higher similarity scores and contribute stronger, more informative gradients.
Rank-Based Selection:
Selecting negatives from ranks 10-110 (rather than top-1) avoids:
- False negatives: Top-ranked documents may actually be relevant but unlabeled
- Trivial negatives: Low-ranked documents are too easy to distinguish
- The "sweet spot" provides challenging but reliable negative examples
Multi-Language Support:
Language-specific embedding models are selected based on a language code:
- English: BAAI/bge (small/base/large)
- Chinese: thenlper/gte-zh
- French: OrdalieTech/Solon
- Other: intfloat/multilingual-e5