Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding BGE Hn Mine

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Information_Retrieval, Data_Mining
Last Updated 2026-02-09 00:00 GMT

Overview

Hard negative mining utility for improving embedding model training by finding challenging but incorrect passages.

Description

This implementation mines hard negatives for training data by using an embedding model to find passages that are semantically similar to queries but are not labeled as positive examples. The find_knn_neg() function:

1. Encodes all positive and existing negative passages along with a candidate pool into dense embeddings 2. Creates a FAISS index for efficient k-nearest neighbor search 3. For each query, retrieves top-k similar passages from a specified range (e.g., rank 10-210) 4. Filters out positives and the query itself to obtain hard negatives 5. Samples a fixed number of hard negatives per query, backfilling with random samples if needed

The range-based sampling (e.g., 10-210) avoids trivially easy negatives (top 10) while focusing on challenging examples that could confuse the model. This approach improves training by helping the model learn to distinguish between truly relevant and deceptively similar passages.

Usage

Use this during data preparation for embedding model training to enhance existing training datasets with hard negative examples mined from your corpus.

Code Reference

Source Location

Signature

def find_knn_neg(model, input_file, candidate_pool, output_file,
                 sample_range, negative_number, use_gpu)

Import

from research.baai_general_embedding.finetune.hn_mine import find_knn_neg

I/O Contract

Inputs

Name Type Required Description
model FlagModel Yes Embedding model for encoding queries and passages
input_file str Yes Path to JSONL file with query, pos, and optional neg fields
candidate_pool str/List Yes Path to candidate passages or list of passage texts
sample_range List[int] Yes [start, end] range for sampling negatives (e.g., [10, 210])
negative_number int Yes Number of hard negatives to sample per query
use_gpu bool Yes Whether to use GPU for FAISS indexing and search

Outputs

Name Type Description
output_file JSONL File with each line containing query, pos, and mined neg fields

Usage Examples

from FlagEmbedding import FlagModel
from research.baai_general_embedding.finetune.hn_mine import find_knn_neg

# Initialize model
model = FlagModel(
    "BAAI/bge-base-en",
    query_instruction_for_retrieval="Represent this sentence for searching relevant passages: "
)

# Mine hard negatives
find_knn_neg(
    model=model,
    input_file="train_data.jsonl",
    candidate_pool="corpus.jsonl",
    output_file="train_with_hard_neg.jsonl",
    sample_range=[10, 210],  # Sample from rank 10 to 210
    negative_number=15,      # Mine 15 hard negatives per query
    use_gpu=True
)

# Input format (train_data.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."]}

# Output format (train_with_hard_neg.jsonl):
# {"query": "what is deep learning", "pos": ["Deep learning is..."],
#  "neg": ["Machine learning...", "Neural networks...", ...]}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment