Principle:AnswerDotAI RAGatouille Training Data Preparation

Knowledge Sources	ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling RAGatouille
Domains	NLP, Information_Retrieval, Training, Data_Processing
Last Updated	2026-02-12 12:00 GMT

Overview

A data processing pipeline that converts raw query-document pairs, triplets, or labeled pairs into ColBERT-compatible training files with optional hard negative mining.

Description

Training Data Preparation transforms user-provided training data into the format required by the ColBERT trainer. It accepts three input formats: unlabeled pairs (query, positive_passage), triplets (query, positive, negative), and labeled pairs (query, passage, label). The pipeline normalizes all formats into training triplets (query_id, positive_id, negative_id), optionally augments negatives via hard negative mining using dense embeddings, and exports the data as ColBERT-format files (triples JSONL, queries TSV, corpus TSV).

Key processing steps:

Format detection from sample data (pairs vs triplets vs labeled_pairs)
Query and document extraction and deduplication
Optional hard negative mining via SimpleMiner using dense embedding models
Fallback to random negative sampling if hard negatives fail
Triplet generation with configurable negatives-per-query
Export to ColBERT-format files: triples.train.colbert.jsonl, queries.train.colbert.tsv, corpus.train.colbert.tsv

Usage

Use this principle after initializing a trainer and before launching training. This is required when:

You have query-document relevance pairs and need to generate training triplets
You want to augment training data with hard negatives for better model performance
You need to convert various data formats into ColBERT's expected training format

Theoretical Basis

Effective contrastive learning for retrieval requires informative negative examples. Hard negatives — documents that are superficially similar to the query but not relevant — provide stronger training signal than random negatives:

Hard Negative Mining:

Encode all documents with a dense embedding model (e.g., BGE, E5)
Build an approximate nearest-neighbor index (Voyager)
For each query, retrieve top-k similar documents
Select documents ranked between min_rank (10) and max_rank (~110) as hard negatives
These are similar enough to be challenging but not so similar they might be false negatives

Triplet Formation: For each query q with positive documents P and negative documents N:

Generate up to 20 triplets per query
Distribute negatives across positives evenly
Shuffle with fixed seed for reproducibility

Related Pages

Implemented By

Implementation:AnswerDotAI_RAGatouille_RAGTrainer_Prepare_Training_Data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment