Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding Hn Mine Script

From Leeroopedia


Template:Implementation

Overview

The script is a CLI tool using HfArgumentParser with two dataclasses: DataArgs and ModelArgs.

Signature (CLI)

python scripts/hn_mine.py \
    --model_name_or_path BAAI/bge-base-en-v1.5 \
    --input_file train_data.jsonl \
    --output_file train_data_minedHN.jsonl \
    --range_for_sampling 10-210 \
    --negative_number 15 \
    --use_gpu_for_searching

Key Parameters

Parameter Type Default Description
model_name_or_path str (required) Path or name of the embedding model to use
input_file str (required) Path to the input JSONL training data
output_file str (required) Path for the output JSONL with hard negatives
candidate_pool Optional[str] None Optional separate corpus file for mining
range_for_sampling str "10-210" Rank range to sample hard negatives from
negative_number int 15 Number of hard negatives to mine per query
use_gpu_for_searching bool False Whether to use GPU for FAISS search
search_batch_size int 64 Batch size for FAISS search

I/O

Input: JSONL training data (with query, pos, and optionally existing neg fields).

Output: Augmented JSONL with hard negatives populated in the neg field.

Internal Behavior

  1. Loads embedder via FlagAutoModel.from_finetuned()
  2. Encodes all corpus passages and queries into dense vectors
  3. Builds a FAISS IndexFlatIP index over corpus embeddings
  4. Searches the index to retrieve top-k nearest neighbors per query
  5. Samples negatives from the specified rank range (e.g., ranks 10 through 210)
  6. Filters out any passages that appear in the positive set
  7. Writes the augmented data to the output file

Related Pages

Principle:FlagOpen_FlagEmbedding_Hard_Negative_Mining

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment