Principle:FlagOpen FlagEmbedding Hard Negative Mining

Overview

A technique that identifies challenging negative examples by retrieving passages that are similar to the query but not relevant, improving the discriminative ability of contrastive learning.

Description

Random negatives are too easy for the model to distinguish. Hard negatives are passages ranked highly by an embedding model but not in the positive set. The hn_mine.py script:

Encodes all corpus passages with an embedder
Builds a FAISS index
Retrieves top-k candidates per query
Filters out positives and samples from a specified rank range (e.g., 10-210) to get hard negatives

This avoids the hardest negatives (likely false negatives) and the easiest ones.

Usage

After preparing initial training data and before training to enhance negative quality.

Theoretical Basis

Hard negatives increase gradient signal in contrastive learning. The sampling range avoids rank 1-10 (likely false negatives) and very low ranks (too easy). FAISS IndexFlatIP enables exact inner product search with optional GPU acceleration.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment