Implementation:AnswerDotAI RAGatouille RAGTrainer Prepare Training Data
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Training, Data_Processing |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Concrete tool for converting raw training data into ColBERT-format training files with optional hard negative mining provided by the RAGatouille library.
Description
The RAGTrainer.prepare_training_data() method orchestrates the full training data pipeline. It auto-detects the input data format (pairs, triplets, or labeled_pairs) from the first sample, optionally initializes a SimpleMiner for hard negative mining, creates a TrainingDataProcessor, and delegates the conversion. The processor generates training triplets and exports them as ColBERT-format files. If hard negative mining produces no triplets, it automatically falls back to random negative sampling.
The delegation chain:
- RAGTrainer.prepare_training_data() --> detects format, initializes miner, creates processor
- SimpleMiner.__init__() + build_index() --> loads dense model, builds Voyager ANN index
- TrainingDataProcessor.process_raw_data() --> converts to triplets, mines negatives, exports files
Usage
Use after initializing a RAGTrainer and before calling train(). Supports three input formats for maximum flexibility in how training data is provided.
Code Reference
Source Location
- Repository: RAGatouille
- File: ragatouille/RAGTrainer.py
- Lines: L68-179
Signature
def prepare_training_data(
self,
raw_data: Union[list[tuple], list[list]],
all_documents: Optional[list[str]] = None,
data_out_path: Union[str, Path] = "./data/",
num_new_negatives: int = 10,
hard_negative_minimum_rank: int = 10,
mine_hard_negatives: bool = True,
hard_negative_model_size: str = "small",
pairs_with_labels: bool = False,
positive_label: Union[int, str] = 1,
negative_label: Union[int, str] = 0,
) -> str:
"""
Pre-process raw training data into ColBERT-ready files and triplets.
Parameters:
raw_data: List of pairs, annotated pairs, or triplets.
all_documents: Optional corpus for negative sampling.
data_out_path: Export directory path.
num_new_negatives: Negatives per query (default 10).
hard_negative_minimum_rank: Min rank for hard negatives (default 10).
mine_hard_negatives: Use dense model for hard negatives (default True).
hard_negative_model_size: "small", "base", or "large".
pairs_with_labels: Whether data has labels.
positive_label: Label for positive pairs (default 1).
negative_label: Label for negative pairs (default 0).
Returns:
str: Path to exported training data directory.
"""
Import
from ragatouille import RAGTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| raw_data | Union[list[tuple], list[list]] | Yes | Training data as pairs (query, positive), triplets (query, positive, negative), or labeled pairs (query, passage, label) |
| all_documents | Optional[list[str]] | No | Full document corpus for negative sampling |
| data_out_path | Union[str, Path] | No | Export directory (default "./data/") |
| num_new_negatives | int | No | Number of negatives per query (default 10) |
| hard_negative_minimum_rank | int | No | Minimum rank for hard negatives (default 10) |
| mine_hard_negatives | bool | No | Use dense model for hard negatives (default True) |
| hard_negative_model_size | str | No | Embedding model size: "small", "base", "large" (default "small") |
| pairs_with_labels | bool | No | Whether raw_data contains labels (default False) |
| positive_label | Union[int, str] | No | Label value for positive pairs (default 1) |
| negative_label | Union[int, str] | No | Label value for negative pairs (default 0) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | str | Path to the export directory containing: triples.train.colbert.jsonl, queries.train.colbert.tsv, corpus.train.colbert.tsv |
Usage Examples
Prepare from Query-Document Pairs
from ragatouille import RAGTrainer
trainer = RAGTrainer(
model_name="my_model",
pretrained_model_name="colbert-ir/colbertv2.0",
)
# Pairs format: (query, relevant_passage)
pairs = [
("What is Python?", "Python is a programming language."),
("What is Java?", "Java is an object-oriented language."),
("What is Rust?", "Rust is a systems programming language."),
]
trainer.prepare_training_data(
raw_data=pairs,
num_new_negatives=10,
mine_hard_negatives=True,
)
Prepare from Triplets
# Triplets format: (query, positive, negative)
triplets = [
("What is Python?", "Python is a programming language.", "Java is compiled."),
("What is NLP?", "NLP processes human language.", "CPU is a processor."),
]
trainer.prepare_training_data(
raw_data=triplets,
num_new_negatives=5,
mine_hard_negatives=False, # Already have negatives
)
Prepare from Labeled Pairs
# Labeled pairs: (query, passage, label)
labeled = [
("What is ML?", "Machine learning is a branch of AI.", 1),
("What is ML?", "The weather is sunny today.", 0),
]
trainer.prepare_training_data(
raw_data=labeled,
pairs_with_labels=True,
positive_label=1,
negative_label=0,
)