Implementation:FlagOpen FlagEmbedding BGE Hn Mine
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Information_Retrieval, Data_Mining |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Hard negative mining utility for improving embedding model training by finding challenging but incorrect passages.
Description
This implementation mines hard negatives for training data by using an embedding model to find passages that are semantically similar to queries but are not labeled as positive examples. The find_knn_neg() function:
1. Encodes all positive and existing negative passages along with a candidate pool into dense embeddings 2. Creates a FAISS index for efficient k-nearest neighbor search 3. For each query, retrieves top-k similar passages from a specified range (e.g., rank 10-210) 4. Filters out positives and the query itself to obtain hard negatives 5. Samples a fixed number of hard negatives per query, backfilling with random samples if needed
The range-based sampling (e.g., 10-210) avoids trivially easy negatives (top 10) while focusing on challenging examples that could confuse the model. This approach improves training by helping the model learn to distinguish between truly relevant and deceptively similar passages.
Usage
Use this during data preparation for embedding model training to enhance existing training datasets with hard negative examples mined from your corpus.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/baai_general_embedding/finetune/hn_mine.py
- Lines: 1-122
Signature
def find_knn_neg(model, input_file, candidate_pool, output_file,
sample_range, negative_number, use_gpu)
Import
from research.baai_general_embedding.finetune.hn_mine import find_knn_neg
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | FlagModel | Yes | Embedding model for encoding queries and passages |
| input_file | str | Yes | Path to JSONL file with query, pos, and optional neg fields |
| candidate_pool | str/List | Yes | Path to candidate passages or list of passage texts |
| sample_range | List[int] | Yes | [start, end] range for sampling negatives (e.g., [10, 210]) |
| negative_number | int | Yes | Number of hard negatives to sample per query |
| use_gpu | bool | Yes | Whether to use GPU for FAISS indexing and search |
Outputs
| Name | Type | Description |
|---|---|---|
| output_file | JSONL | File with each line containing query, pos, and mined neg fields |
Usage Examples
from FlagEmbedding import FlagModel
from research.baai_general_embedding.finetune.hn_mine import find_knn_neg
# Initialize model
model = FlagModel(
"BAAI/bge-base-en",
query_instruction_for_retrieval="Represent this sentence for searching relevant passages: "
)
# Mine hard negatives
find_knn_neg(
model=model,
input_file="train_data.jsonl",
candidate_pool="corpus.jsonl",
output_file="train_with_hard_neg.jsonl",
sample_range=[10, 210], # Sample from rank 10 to 210
negative_number=15, # Mine 15 hard negatives per query
use_gpu=True
)
# Input format (train_data.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."]}
# Output format (train_with_hard_neg.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."],
# "neg": ["Machine learning...", "Neural networks...", ...]}