Implementation:AnswerDotAI RAGatouille RAGTrainer Prepare Training Data

Knowledge Sources	RAGatouille RAGatouille Docs
Domains	NLP, Information_Retrieval, Training, Data_Processing
Last Updated	2026-02-12 12:00 GMT

Overview

Concrete tool for converting raw training data into ColBERT-format training files with optional hard negative mining provided by the RAGatouille library.

Description

The RAGTrainer.prepare_training_data() method orchestrates the full training data pipeline. It auto-detects the input data format (pairs, triplets, or labeled_pairs) from the first sample, optionally initializes a SimpleMiner for hard negative mining, creates a TrainingDataProcessor, and delegates the conversion. The processor generates training triplets and exports them as ColBERT-format files. If hard negative mining produces no triplets, it automatically falls back to random negative sampling.

The delegation chain:

RAGTrainer.prepare_training_data() --> detects format, initializes miner, creates processor
SimpleMiner.__init__() + build_index() --> loads dense model, builds Voyager ANN index
TrainingDataProcessor.process_raw_data() --> converts to triplets, mines negatives, exports files

Usage

Use after initializing a RAGTrainer and before calling train(). Supports three input formats for maximum flexibility in how training data is provided.

Code Reference

Source Location

Repository: RAGatouille
File: ragatouille/RAGTrainer.py
Lines: L68-179

Signature

def prepare_training_data(
    self,
    raw_data: Union[list[tuple], list[list]],
    all_documents: Optional[list[str]] = None,
    data_out_path: Union[str, Path] = "./data/",
    num_new_negatives: int = 10,
    hard_negative_minimum_rank: int = 10,
    mine_hard_negatives: bool = True,
    hard_negative_model_size: str = "small",
    pairs_with_labels: bool = False,
    positive_label: Union[int, str] = 1,
    negative_label: Union[int, str] = 0,
) -> str:
    """
    Pre-process raw training data into ColBERT-ready files and triplets.

    Parameters:
        raw_data: List of pairs, annotated pairs, or triplets.
        all_documents: Optional corpus for negative sampling.
        data_out_path: Export directory path.
        num_new_negatives: Negatives per query (default 10).
        hard_negative_minimum_rank: Min rank for hard negatives (default 10).
        mine_hard_negatives: Use dense model for hard negatives (default True).
        hard_negative_model_size: "small", "base", or "large".
        pairs_with_labels: Whether data has labels.
        positive_label: Label for positive pairs (default 1).
        negative_label: Label for negative pairs (default 0).

    Returns:
        str: Path to exported training data directory.
    """

Import

from ragatouille import RAGTrainer

I/O Contract

Inputs

Name	Type	Required	Description
raw_data	Union[list[tuple], list[list]]	Yes	Training data as pairs (query, positive), triplets (query, positive, negative), or labeled pairs (query, passage, label)
all_documents	Optional[list[str]]	No	Full document corpus for negative sampling
data_out_path	Union[str, Path]	No	Export directory (default "./data/")
num_new_negatives	int	No	Number of negatives per query (default 10)
hard_negative_minimum_rank	int	No	Minimum rank for hard negatives (default 10)
mine_hard_negatives	bool	No	Use dense model for hard negatives (default True)
hard_negative_model_size	str	No	Embedding model size: "small", "base", "large" (default "small")
pairs_with_labels	bool	No	Whether raw_data contains labels (default False)
positive_label	Union[int, str]	No	Label value for positive pairs (default 1)
negative_label	Union[int, str]	No	Label value for negative pairs (default 0)

Outputs

Name	Type	Description
return	str	Path to the export directory containing: triples.train.colbert.jsonl, queries.train.colbert.tsv, corpus.train.colbert.tsv

Usage Examples

Prepare from Query-Document Pairs

from ragatouille import RAGTrainer

trainer = RAGTrainer(
    model_name="my_model",
    pretrained_model_name="colbert-ir/colbertv2.0",
)

# Pairs format: (query, relevant_passage)
pairs = [
    ("What is Python?", "Python is a programming language."),
    ("What is Java?", "Java is an object-oriented language."),
    ("What is Rust?", "Rust is a systems programming language."),
]

trainer.prepare_training_data(
    raw_data=pairs,
    num_new_negatives=10,
    mine_hard_negatives=True,
)

Prepare from Triplets

# Triplets format: (query, positive, negative)
triplets = [
    ("What is Python?", "Python is a programming language.", "Java is compiled."),
    ("What is NLP?", "NLP processes human language.", "CPU is a processor."),
]

trainer.prepare_training_data(
    raw_data=triplets,
    num_new_negatives=5,
    mine_hard_negatives=False,  # Already have negatives
)

Prepare from Labeled Pairs

# Labeled pairs: (query, passage, label)
labeled = [
    ("What is ML?", "Machine learning is a branch of AI.", 1),
    ("What is ML?", "The weather is sunny today.", 0),
]

trainer.prepare_training_data(
    raw_data=labeled,
    pairs_with_labels=True,
    positive_label=1,
    negative_label=0,
)

Related Pages

Implements Principle

Principle:AnswerDotAI_RAGatouille_Training_Data_Preparation

Requires Environment

Environment:AnswerDotAI_RAGatouille_Python_ColBERT_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment