Implementation:AnswerDotAI RAGatouille RAGTrainer Train

Knowledge Sources	RAGatouille RAGatouille Docs
Domains	NLP, Information_Retrieval, Training, Fine_Tuning
Last Updated	2026-02-12 12:00 GMT

Overview

Concrete tool for launching ColBERT model training on prepared triplet data provided by the RAGatouille library.

Description

The RAGTrainer.train() method launches the ColBERT training loop. It constructs a ColBERTConfig from user parameters and auto-computed values (warmup steps, save frequency), then delegates to ColBERT.train() which initializes the colbert-ai Trainer with training data files and runs the optimization loop. The method returns the path to the best model checkpoint.

The delegation chain:

RAGTrainer.train() → constructs config, delegates to model
ColBERT.train() → merges configs (nway=2), creates colbert-ai Trainer, runs training
colbert.Trainer.train() → executes the training loop

Usage

Call after prepare_training_data(). The returned checkpoint path can be used with RAGPretrainedModel.from_pretrained() for inference.

Code Reference

Source Location

Repository: RAGatouille
File: ragatouille/RAGTrainer.py
Lines: L181-238

Signature

def train(
    self,
    batch_size: int = 32,
    nbits: int = 2,
    maxsteps: int = 500_000,
    use_ib_negatives: bool = True,
    learning_rate: float = 5e-6,
    dim: int = 128,
    doc_maxlen: int = 256,
    use_relu: bool = False,
    warmup_steps: Union[int, Literal["auto"]] = "auto",
    accumsteps: int = 1,
) -> str:
    """
    Launch training or fine-tuning of a ColBERT model.

    Parameters:
        batch_size: Total batch size (default 32).
        nbits: Compression bits (default 2).
        maxsteps: Max training steps (default 500,000).
        use_ib_negatives: Use in-batch negatives (default True).
        learning_rate: Learning rate (default 5e-6).
        dim: Embedding dimension (default 128).
        doc_maxlen: Max document length (default 256).
        use_relu: Use ReLU on embeddings (default False).
        warmup_steps: Warmup steps ("auto" = 10% of total).
        accumsteps: Gradient accumulation steps (default 1).

    Returns:
        str: Path to the best model checkpoint.
    """

Import

from ragatouille import RAGTrainer

I/O Contract

Inputs

Name	Type	Required	Description
batch_size	int	No	Total batch size across GPUs (default 32)
nbits	int	No	Vector compression bits (default 2)
maxsteps	int	No	Maximum training steps (default 500,000)
use_ib_negatives	bool	No	Enable in-batch negatives (default True)
learning_rate	float	No	Learning rate (default 5e-6, recommended 3e-6 to 2e-5)
dim	int	No	Embedding dimension (default 128)
doc_maxlen	int	No	Maximum document token length (default 256)
use_relu	bool	No	Apply ReLU to embeddings (default False)
warmup_steps	Union[int, Literal["auto"]]	No	Warmup steps ("auto" = 10% of total, default "auto")
accumsteps	int	No	Gradient accumulation steps (default 1)

Outputs

Name	Type	Description
return	str	Path to the best model checkpoint directory

Usage Examples

Standard Fine-tuning

from ragatouille import RAGTrainer

trainer = RAGTrainer(
    model_name="my_colbert",
    pretrained_model_name="colbert-ir/colbertv2.0",
)

trainer.prepare_training_data(
    raw_data=[("query", "relevant doc"), ...],
)

model_path = trainer.train(
    batch_size=32,
    learning_rate=5e-6,
    maxsteps=100_000,
)

print(f"Best checkpoint: {model_path}")

Training with Custom Parameters

model_path = trainer.train(
    batch_size=64,
    nbits=2,
    maxsteps=200_000,
    use_ib_negatives=True,
    learning_rate=1e-5,
    dim=128,
    doc_maxlen=512,
    warmup_steps=1000,
    accumsteps=2,
)

Related Pages

Implements Principle

Principle:AnswerDotAI_RAGatouille_Model_Training

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment