Implementation:FlagOpen FlagEmbedding Hn Mine Script

Overview

The script is a CLI tool using HfArgumentParser with two dataclasses: DataArgs and ModelArgs.

Signature (CLI)

python scripts/hn_mine.py \
    --model_name_or_path BAAI/bge-base-en-v1.5 \
    --input_file train_data.jsonl \
    --output_file train_data_minedHN.jsonl \
    --range_for_sampling 10-210 \
    --negative_number 15 \
    --use_gpu_for_searching

Key Parameters

Parameter	Type	Default	Description
`model_name_or_path`	str	(required)	Path or name of the embedding model to use
`input_file`	str	(required)	Path to the input JSONL training data
`output_file`	str	(required)	Path for the output JSONL with hard negatives
`candidate_pool`	Optional[str]	None	Optional separate corpus file for mining
`range_for_sampling`	str	"10-210"	Rank range to sample hard negatives from
`negative_number`	int	15	Number of hard negatives to mine per query
`use_gpu_for_searching`	bool	False	Whether to use GPU for FAISS search
`search_batch_size`	int	64	Batch size for FAISS search

I/O

Input: JSONL training data (with query, pos, and optionally existing neg fields).

Output: Augmented JSONL with hard negatives populated in the neg field.

Internal Behavior

Loads embedder via FlagAutoModel.from_finetuned()
Encodes all corpus passages and queries into dense vectors
Builds a FAISS IndexFlatIP index over corpus embeddings
Searches the index to retrieve top-k nearest neighbors per query
Samples negatives from the specified rank range (e.g., ranks 10 through 210)
Filters out any passages that appear in the positive set
Writes the augmented data to the output file

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment