Workflow:AnswerDotAI RAGatouille ColBERT Training

Knowledge Sources	RAGatouille RAGatouille Docs ColBERTv2
Domains	Information_Retrieval, Model_Training, ColBERT
Last Updated	2026-02-12 12:00 GMT

Overview

End-to-end process for fine-tuning an existing ColBERT model or training a new ColBERT model from a BERT-like checkpoint using RAGatouille's RAGTrainer.

Description

This workflow covers the training pipeline in RAGatouille: initializing a trainer with a base model, preparing training data from various input formats (pairs, labeled pairs, triplets), optionally mining hard negatives using dense retrieval, and launching the ColBERT training loop. The trainer automatically handles data format conversion, triplet generation, negative sampling, and ColBERT-specific configuration.

Key outputs:

A fine-tuned ColBERT model checkpoint saved to disk
Processed training data in ColBERT format (queries, corpus, triplets)

Scope:

From raw query-document pairs to a trained ColBERT model
Covers data preparation, hard negative mining, and training launch

Strategy:

If the pretrained model is already a ColBERT checkpoint, the trainer runs in fine-tuning mode
If it is a generic BERT/RoBERTa model, it trains a new ColBERT from scratch

Usage

Execute this workflow when you need to adapt a ColBERT model to a specific domain or language. This is the right workflow when:

You have domain-specific query-document relevance data (pairs, triplets, or labeled pairs)
The pretrained ColBERTv2 zero-shot performance is insufficient for your use case
You want to train a ColBERT model from a non-English BERT checkpoint
You have limited GPU resources and want data-efficient fine-tuning

Execution Steps

Step 1: Initialize the Trainer

Create a RAGTrainer instance by specifying a name for the output model and the pretrained base model to start from. The trainer loads the base model in training mode (no inference checkpoint is created). Optionally set the language code for hard negative mining model selection and configure GPU usage.

Key considerations:

`model_name` determines the checkpoint naming on disk
`pretrained_model_name` can be a HuggingFace ColBERT checkpoint (fine-tuning) or a BERT-like model (training from scratch)
`language_code` affects which dense embedding model is used for hard negative mining (supports en, zh, fr, and multilingual)

Step 2: Prepare Training Data

Pass raw training data to the trainer's data preparation pipeline. The processor accepts multiple input formats and automatically converts them to ColBERT training triplets. Optionally provide a full document corpus for negative sampling.

What the processor handles:

Pairs (query, relevant_passage): Assumes all pairs are positive; generates negatives automatically
Labeled pairs (query, passage, label): Separates positives and negatives by label value
Triplets (query, positive, negative): Uses provided negatives, optionally augments with more

Hard negative mining:

When enabled, a dense embedding model (sentence-transformers) embeds the full corpus
A Voyager ANN index is built for fast approximate nearest-neighbor retrieval
For each query, documents ranked between position 10 and 110 are sampled as hard negatives
Multi-language support via language-specific embedding models (BGE, GTE, Solon, multilingual-E5)

Data export:

Processed data is written to disk in ColBERT format: `queries.train.colbert.tsv`, `corpus.train.colbert.tsv`, and `triples.train.colbert.jsonl`

Step 3: Configure Training Parameters

The training configuration is assembled from the base model's existing ColBERT config merged with user-specified overrides. Key hyperparameters include batch size, learning rate, maximum steps, vector dimensionality, and quantization bits.

Key parameters:

`batch_size`: Total batch size across all GPUs (default 32)
`learning_rate`: Typically 3e-6 to 2e-5 depending on data size (default 5e-6)
`maxsteps`: Early stopping threshold (default 500,000)
`nbits`: Vector compression bits for the trained model (default 2)
`dim`: Individual vector representation size (default 128)
`warmup_steps`: Defaults to 10% of total steps when set to auto
`use_ib_negatives`: Whether to use in-batch negatives for loss calculation (default True)

Step 4: Launch Training

Execute the ColBERT training loop using the colbert-ai Trainer. Training reads the exported triplets, queries, and corpus files, applies the merged configuration, and runs gradient updates. The trainer automatically saves checkpoints at regular intervals and tracks the best checkpoint path.

What happens:

The ColBERT Trainer is initialized with paths to the exported training data files
Training uses 2-way contrastive loss (query, positive, negative)
Checkpoints are saved every 1/10th of total steps
The path to the best checkpoint is returned upon completion

Execution Diagram

GitHub URL

Workflow Repository