Implementation:FlagOpen FlagEmbedding Hn Mine Script
Appearance
Overview
The script is a CLI tool using HfArgumentParser with two dataclasses: DataArgs and ModelArgs.
Signature (CLI)
python scripts/hn_mine.py \
--model_name_or_path BAAI/bge-base-en-v1.5 \
--input_file train_data.jsonl \
--output_file train_data_minedHN.jsonl \
--range_for_sampling 10-210 \
--negative_number 15 \
--use_gpu_for_searching
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name_or_path |
str | (required) | Path or name of the embedding model to use |
input_file |
str | (required) | Path to the input JSONL training data |
output_file |
str | (required) | Path for the output JSONL with hard negatives |
candidate_pool |
Optional[str] | None | Optional separate corpus file for mining |
range_for_sampling |
str | "10-210" | Rank range to sample hard negatives from |
negative_number |
int | 15 | Number of hard negatives to mine per query |
use_gpu_for_searching |
bool | False | Whether to use GPU for FAISS search |
search_batch_size |
int | 64 | Batch size for FAISS search |
I/O
Input: JSONL training data (with query, pos, and optionally existing neg fields).
Output: Augmented JSONL with hard negatives populated in the neg field.
Internal Behavior
- Loads embedder via
FlagAutoModel.from_finetuned() - Encodes all corpus passages and queries into dense vectors
- Builds a FAISS
IndexFlatIPindex over corpus embeddings - Searches the index to retrieve top-k nearest neighbors per query
- Samples negatives from the specified rank range (e.g., ranks 10 through 210)
- Filters out any passages that appear in the positive set
- Writes the augmented data to the output file
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment