Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:AnswerDotAI RAGatouille ColBERT Training

From Leeroopedia
Knowledge Sources
Domains Information_Retrieval, Model_Training, ColBERT
Last Updated 2026-02-12 12:00 GMT

Overview

End-to-end process for fine-tuning an existing ColBERT model or training a new ColBERT model from a BERT-like checkpoint using RAGatouille's RAGTrainer.

Description

This workflow covers the training pipeline in RAGatouille: initializing a trainer with a base model, preparing training data from various input formats (pairs, labeled pairs, triplets), optionally mining hard negatives using dense retrieval, and launching the ColBERT training loop. The trainer automatically handles data format conversion, triplet generation, negative sampling, and ColBERT-specific configuration.

Key outputs:

  • A fine-tuned ColBERT model checkpoint saved to disk
  • Processed training data in ColBERT format (queries, corpus, triplets)

Scope:

  • From raw query-document pairs to a trained ColBERT model
  • Covers data preparation, hard negative mining, and training launch

Strategy:

  • If the pretrained model is already a ColBERT checkpoint, the trainer runs in fine-tuning mode
  • If it is a generic BERT/RoBERTa model, it trains a new ColBERT from scratch

Usage

Execute this workflow when you need to adapt a ColBERT model to a specific domain or language. This is the right workflow when:

  • You have domain-specific query-document relevance data (pairs, triplets, or labeled pairs)
  • The pretrained ColBERTv2 zero-shot performance is insufficient for your use case
  • You want to train a ColBERT model from a non-English BERT checkpoint
  • You have limited GPU resources and want data-efficient fine-tuning

Execution Steps

Step 1: Initialize the Trainer

Create a RAGTrainer instance by specifying a name for the output model and the pretrained base model to start from. The trainer loads the base model in training mode (no inference checkpoint is created). Optionally set the language code for hard negative mining model selection and configure GPU usage.

Key considerations:

  • `model_name` determines the checkpoint naming on disk
  • `pretrained_model_name` can be a HuggingFace ColBERT checkpoint (fine-tuning) or a BERT-like model (training from scratch)
  • `language_code` affects which dense embedding model is used for hard negative mining (supports en, zh, fr, and multilingual)

Step 2: Prepare Training Data

Pass raw training data to the trainer's data preparation pipeline. The processor accepts multiple input formats and automatically converts them to ColBERT training triplets. Optionally provide a full document corpus for negative sampling.

What the processor handles:

  • Pairs (query, relevant_passage): Assumes all pairs are positive; generates negatives automatically
  • Labeled pairs (query, passage, label): Separates positives and negatives by label value
  • Triplets (query, positive, negative): Uses provided negatives, optionally augments with more

Hard negative mining:

  • When enabled, a dense embedding model (sentence-transformers) embeds the full corpus
  • A Voyager ANN index is built for fast approximate nearest-neighbor retrieval
  • For each query, documents ranked between position 10 and 110 are sampled as hard negatives
  • Multi-language support via language-specific embedding models (BGE, GTE, Solon, multilingual-E5)

Data export:

  • Processed data is written to disk in ColBERT format: `queries.train.colbert.tsv`, `corpus.train.colbert.tsv`, and `triples.train.colbert.jsonl`

Step 3: Configure Training Parameters

The training configuration is assembled from the base model's existing ColBERT config merged with user-specified overrides. Key hyperparameters include batch size, learning rate, maximum steps, vector dimensionality, and quantization bits.

Key parameters:

  • `batch_size`: Total batch size across all GPUs (default 32)
  • `learning_rate`: Typically 3e-6 to 2e-5 depending on data size (default 5e-6)
  • `maxsteps`: Early stopping threshold (default 500,000)
  • `nbits`: Vector compression bits for the trained model (default 2)
  • `dim`: Individual vector representation size (default 128)
  • `warmup_steps`: Defaults to 10% of total steps when set to auto
  • `use_ib_negatives`: Whether to use in-batch negatives for loss calculation (default True)

Step 4: Launch Training

Execute the ColBERT training loop using the colbert-ai Trainer. Training reads the exported triplets, queries, and corpus files, applies the merged configuration, and runs gradient updates. The trainer automatically saves checkpoints at regular intervals and tracks the best checkpoint path.

What happens:

  • The ColBERT Trainer is initialized with paths to the exported training data files
  • Training uses 2-way contrastive loss (query, positive, negative)
  • Checkpoints are saved every 1/10th of total steps
  • The path to the best checkpoint is returned upon completion

Execution Diagram

GitHub URL

Workflow Repository