Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:AnswerDotAI RAGatouille Training Data Preparation

From Leeroopedia
Knowledge Sources
Domains NLP, Information_Retrieval, Training, Data_Processing
Last Updated 2026-02-12 12:00 GMT

Overview

A data processing pipeline that converts raw query-document pairs, triplets, or labeled pairs into ColBERT-compatible training files with optional hard negative mining.

Description

Training Data Preparation transforms user-provided training data into the format required by the ColBERT trainer. It accepts three input formats: unlabeled pairs (query, positive_passage), triplets (query, positive, negative), and labeled pairs (query, passage, label). The pipeline normalizes all formats into training triplets (query_id, positive_id, negative_id), optionally augments negatives via hard negative mining using dense embeddings, and exports the data as ColBERT-format files (triples JSONL, queries TSV, corpus TSV).

Key processing steps:

  • Format detection from sample data (pairs vs triplets vs labeled_pairs)
  • Query and document extraction and deduplication
  • Optional hard negative mining via SimpleMiner using dense embedding models
  • Fallback to random negative sampling if hard negatives fail
  • Triplet generation with configurable negatives-per-query
  • Export to ColBERT-format files: triples.train.colbert.jsonl, queries.train.colbert.tsv, corpus.train.colbert.tsv

Usage

Use this principle after initializing a trainer and before launching training. This is required when:

  • You have query-document relevance pairs and need to generate training triplets
  • You want to augment training data with hard negatives for better model performance
  • You need to convert various data formats into ColBERT's expected training format

Theoretical Basis

Effective contrastive learning for retrieval requires informative negative examples. Hard negatives — documents that are superficially similar to the query but not relevant — provide stronger training signal than random negatives:

Hard Negative Mining:

  1. Encode all documents with a dense embedding model (e.g., BGE, E5)
  2. Build an approximate nearest-neighbor index (Voyager)
  3. For each query, retrieve top-k similar documents
  4. Select documents ranked between min_rank (10) and max_rank (~110) as hard negatives
  5. These are similar enough to be challenging but not so similar they might be false negatives

Triplet Formation: For each query q with positive documents P and negative documents N:

  • Generate up to 20 triplets per query
  • Distribute negatives across positives evenly
  • Shuffle with fixed seed for reproducibility

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment