Implementation:FlagOpen FlagEmbedding BGE Hn Mine

Knowledge Sources	FlagOpen_FlagEmbedding
Domains	Machine_Learning, Information_Retrieval, Data_Mining
Last Updated	2026-02-09 00:00 GMT

Overview

Hard negative mining utility for improving embedding model training by finding challenging but incorrect passages.

Description

This implementation mines hard negatives for training data by using an embedding model to find passages that are semantically similar to queries but are not labeled as positive examples. The find_knn_neg() function:

1. Encodes all positive and existing negative passages along with a candidate pool into dense embeddings 2. Creates a FAISS index for efficient k-nearest neighbor search 3. For each query, retrieves top-k similar passages from a specified range (e.g., rank 10-210) 4. Filters out positives and the query itself to obtain hard negatives 5. Samples a fixed number of hard negatives per query, backfilling with random samples if needed

The range-based sampling (e.g., 10-210) avoids trivially easy negatives (top 10) while focusing on challenging examples that could confuse the model. This approach improves training by helping the model learn to distinguish between truly relevant and deceptively similar passages.

Usage

Use this during data preparation for embedding model training to enhance existing training datasets with hard negative examples mined from your corpus.

Code Reference

Source Location

Repository: FlagOpen_FlagEmbedding
File: research/baai_general_embedding/finetune/hn_mine.py
Lines: 1-122

Signature

def find_knn_neg(model, input_file, candidate_pool, output_file,
                 sample_range, negative_number, use_gpu)

Import

from research.baai_general_embedding.finetune.hn_mine import find_knn_neg

I/O Contract

Inputs

Name	Type	Required	Description
model	FlagModel	Yes	Embedding model for encoding queries and passages
input_file	str	Yes	Path to JSONL file with query, pos, and optional neg fields
candidate_pool	str/List	Yes	Path to candidate passages or list of passage texts
sample_range	List[int]	Yes	[start, end] range for sampling negatives (e.g., [10, 210])
negative_number	int	Yes	Number of hard negatives to sample per query
use_gpu	bool	Yes	Whether to use GPU for FAISS indexing and search

Outputs

Name	Type	Description
output_file	JSONL	File with each line containing query, pos, and mined neg fields

Usage Examples

from FlagEmbedding import FlagModel
from research.baai_general_embedding.finetune.hn_mine import find_knn_neg

# Initialize model
model = FlagModel(
    "BAAI/bge-base-en",
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: "
)

# Mine hard negatives
find_knn_neg(
    model=model,
    input_file="train_data.jsonl",
    candidate_pool="corpus.jsonl",
    output_file="train_with_hard_neg.jsonl",
    sample_range=[10, 210],  # Sample from rank 10 to 210
    negative_number=15,      # Mine 15 hard negatives per query
    use_gpu=True
)

# Input format (train_data.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."]}

# Output format (train_with_hard_neg.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."],
#  "neg": ["Machine learning...", "Neural networks...", ...]}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment