Principle:FlagOpen FlagEmbedding Hard Negative Mining
Overview
A technique that identifies challenging negative examples by retrieving passages that are similar to the query but not relevant, improving the discriminative ability of contrastive learning.
Description
Random negatives are too easy for the model to distinguish. Hard negatives are passages ranked highly by an embedding model but not in the positive set. The hn_mine.py script:
- Encodes all corpus passages with an embedder
- Builds a FAISS index
- Retrieves top-k candidates per query
- Filters out positives and samples from a specified rank range (e.g., 10-210) to get hard negatives
This avoids the hardest negatives (likely false negatives) and the easiest ones.
Usage
After preparing initial training data and before training to enhance negative quality.
Theoretical Basis
Hard negatives increase gradient signal in contrastive learning. The sampling range avoids rank 1-10 (likely false negatives) and very low ranks (too easy). FAISS IndexFlatIP enables exact inner product search with optional GPU acceleration.