Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:AnswerDotAI RAGatouille SimpleMiner Mine Hard Negatives

From Leeroopedia
Knowledge Sources
Domains NLP, Information_Retrieval, Training, Negative_Sampling
Last Updated 2026-02-12 12:00 GMT

Overview

Concrete tool for mining hard negative documents using dense embeddings and Voyager ANN search provided by the RAGatouille library.

Description

The SimpleMiner class implements hard negative mining using a two-phase approach. First, __init__() loads a language-appropriate SentenceTransformer model. Then build_index() encodes all documents and builds a Voyager approximate nearest-neighbor index. Finally, mine_hard_negatives() queries the index for each training query, returning documents ranked between min_rank and max_rank as hard negatives.

The class supports:

  • Multi-language models via DenseModels enum (en, zh, fr, other x small, base, large)
  • Multi-process encoding for collections >1000 documents
  • Cosine similarity-based ANN search via Voyager
  • Configurable storage precision (Float32 default, E4M3 for >500k docs)

Usage

Used internally by RAGTrainer.prepare_training_data() when mine_hard_negatives=True. Can also be used standalone for custom negative mining pipelines.

Code Reference

Source Location

  • Repository: RAGatouille
  • File: ragatouille/negative_miners/simpleminer.py
  • Lines: L28-163

Signature

class SimpleMiner(HardNegativeMiner):
    def __init__(
        self,
        language_code: str,
        model_size: Literal["small", "base", "large"] = "small",
    ) -> None:
        """
        Initialize the hard negative miner.

        Parameters:
            language_code: Target language ("en", "zh", "fr", or "other").
            model_size: Embedding model size ("small", "base", "large").
        """

    def build_index(
        self,
        collection: list,
        batch_size: int = 128,
        save_index: bool = False,
        save_path: Union[str, Path] = None,
        force_fp32: bool = True,
    ) -> None:
        """Build the ANN index over a document collection."""

    def mine_hard_negatives(
        self,
        queries: Union[list[str], str],
        collection: Optional[list[str]] = None,
        save_index: bool = False,
        save_path: Union[str, Path] = None,
        force_fp32: bool = True,
    ) -> Union[list[str], list[list[str]]]:
        """
        Mine hard negatives for queries.

        Returns:
            Hard negative document texts ranked between min_rank and max_rank.
        """

Import

from ragatouille.negative_miners import SimpleMiner

I/O Contract

Inputs (__init__)

Name Type Required Description
language_code str Yes Target language: "en", "zh", "fr", or "other"
model_size Literal["small", "base", "large"] No Embedding model size (default "small")

Inputs (build_index)

Name Type Required Description
collection list Yes Documents to index for negative mining
batch_size int No Encoding batch size (default 128)
save_index bool No Whether to save the index to disk (default False)
save_path Union[str, Path] No Path to save the index
force_fp32 bool No Force Float32 storage (default True)

Inputs (mine_hard_negatives)

Name Type Required Description
queries Union[list[str], str] Yes Query string(s) to find hard negatives for
collection Optional[list[str]] No Corpus (used if index not yet built)

Outputs

Name Type Description
build_index returns None Side-effect: self.voyager_index built, self.corpus_map populated
mine_hard_negatives returns Union[list[str], list[list[str]]] Hard negative document texts (single list for single query, list of lists for batch)

Usage Examples

Basic Hard Negative Mining

from ragatouille.negative_miners import SimpleMiner

# Initialize miner with English small model
miner = SimpleMiner(language_code="en", model_size="small")

# Build index over document collection
documents = [
    "Python is a programming language.",
    "Java is compiled to bytecode.",
    "The weather is sunny today.",
    "Machine learning uses neural networks.",
]
miner.build_index(documents)

# Mine hard negatives for a query
negatives = miner.mine_hard_negatives("What programming languages exist?")

Batch Mining

queries = ["What is Python?", "How does Java work?"]
batch_negatives = miner.mine_hard_negatives(queries)
# Returns list of lists, one per query

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment