Principle:AnswerDotAI RAGatouille Training Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Training, Data_Processing |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
A data processing pipeline that converts raw query-document pairs, triplets, or labeled pairs into ColBERT-compatible training files with optional hard negative mining.
Description
Training Data Preparation transforms user-provided training data into the format required by the ColBERT trainer. It accepts three input formats: unlabeled pairs (query, positive_passage), triplets (query, positive, negative), and labeled pairs (query, passage, label). The pipeline normalizes all formats into training triplets (query_id, positive_id, negative_id), optionally augments negatives via hard negative mining using dense embeddings, and exports the data as ColBERT-format files (triples JSONL, queries TSV, corpus TSV).
Key processing steps:
- Format detection from sample data (pairs vs triplets vs labeled_pairs)
- Query and document extraction and deduplication
- Optional hard negative mining via SimpleMiner using dense embedding models
- Fallback to random negative sampling if hard negatives fail
- Triplet generation with configurable negatives-per-query
- Export to ColBERT-format files: triples.train.colbert.jsonl, queries.train.colbert.tsv, corpus.train.colbert.tsv
Usage
Use this principle after initializing a trainer and before launching training. This is required when:
- You have query-document relevance pairs and need to generate training triplets
- You want to augment training data with hard negatives for better model performance
- You need to convert various data formats into ColBERT's expected training format
Theoretical Basis
Effective contrastive learning for retrieval requires informative negative examples. Hard negatives — documents that are superficially similar to the query but not relevant — provide stronger training signal than random negatives:
Hard Negative Mining:
- Encode all documents with a dense embedding model (e.g., BGE, E5)
- Build an approximate nearest-neighbor index (Voyager)
- For each query, retrieve top-k similar documents
- Select documents ranked between min_rank (10) and max_rank (~110) as hard negatives
- These are similar enough to be challenging but not so similar they might be false negatives
Triplet Formation: For each query q with positive documents P and negative documents N:
- Generate up to 20 triplets per query
- Distribute negatives across positives evenly
- Shuffle with fixed seed for reproducibility