Overview
Concrete tool for mining hard negative documents using dense embeddings and Voyager ANN search provided by the RAGatouille library.
Description
The SimpleMiner class implements hard negative mining using a two-phase approach. First, __init__() loads a language-appropriate SentenceTransformer model. Then build_index() encodes all documents and builds a Voyager approximate nearest-neighbor index. Finally, mine_hard_negatives() queries the index for each training query, returning documents ranked between min_rank and max_rank as hard negatives.
The class supports:
- Multi-language models via DenseModels enum (en, zh, fr, other x small, base, large)
- Multi-process encoding for collections >1000 documents
- Cosine similarity-based ANN search via Voyager
- Configurable storage precision (Float32 default, E4M3 for >500k docs)
Usage
Used internally by RAGTrainer.prepare_training_data() when mine_hard_negatives=True. Can also be used standalone for custom negative mining pipelines.
Code Reference
Source Location
- Repository: RAGatouille
- File: ragatouille/negative_miners/simpleminer.py
- Lines: L28-163
Signature
class SimpleMiner(HardNegativeMiner):
def __init__(
self,
language_code: str,
model_size: Literal["small", "base", "large"] = "small",
) -> None:
"""
Initialize the hard negative miner.
Parameters:
language_code: Target language ("en", "zh", "fr", or "other").
model_size: Embedding model size ("small", "base", "large").
"""
def build_index(
self,
collection: list,
batch_size: int = 128,
save_index: bool = False,
save_path: Union[str, Path] = None,
force_fp32: bool = True,
) -> None:
"""Build the ANN index over a document collection."""
def mine_hard_negatives(
self,
queries: Union[list[str], str],
collection: Optional[list[str]] = None,
save_index: bool = False,
save_path: Union[str, Path] = None,
force_fp32: bool = True,
) -> Union[list[str], list[list[str]]]:
"""
Mine hard negatives for queries.
Returns:
Hard negative document texts ranked between min_rank and max_rank.
"""
Import
from ragatouille.negative_miners import SimpleMiner
I/O Contract
Inputs (__init__)
| Name |
Type |
Required |
Description
|
| language_code |
str |
Yes |
Target language: "en", "zh", "fr", or "other"
|
| model_size |
Literal["small", "base", "large"] |
No |
Embedding model size (default "small")
|
Inputs (build_index)
| Name |
Type |
Required |
Description
|
| collection |
list |
Yes |
Documents to index for negative mining
|
| batch_size |
int |
No |
Encoding batch size (default 128)
|
| save_index |
bool |
No |
Whether to save the index to disk (default False)
|
| save_path |
Union[str, Path] |
No |
Path to save the index
|
| force_fp32 |
bool |
No |
Force Float32 storage (default True)
|
Inputs (mine_hard_negatives)
| Name |
Type |
Required |
Description
|
| queries |
Union[list[str], str] |
Yes |
Query string(s) to find hard negatives for
|
| collection |
Optional[list[str]] |
No |
Corpus (used if index not yet built)
|
Outputs
| Name |
Type |
Description
|
| build_index returns |
None |
Side-effect: self.voyager_index built, self.corpus_map populated
|
| mine_hard_negatives returns |
Union[list[str], list[list[str]]] |
Hard negative document texts (single list for single query, list of lists for batch)
|
Usage Examples
Basic Hard Negative Mining
from ragatouille.negative_miners import SimpleMiner
# Initialize miner with English small model
miner = SimpleMiner(language_code="en", model_size="small")
# Build index over document collection
documents = [
"Python is a programming language.",
"Java is compiled to bytecode.",
"The weather is sunny today.",
"Machine learning uses neural networks.",
]
miner.build_index(documents)
# Mine hard negatives for a query
negatives = miner.mine_hard_negatives("What programming languages exist?")
Batch Mining
queries = ["What is Python?", "How does Java work?"]
batch_negatives = miner.mine_hard_negatives(queries)
# Returns list of lists, one per query
Related Pages
Implements Principle
Requires Environment