Implementation:AnswerDotAI RAGatouille RAGTrainer Train
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Training, Fine_Tuning |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Concrete tool for launching ColBERT model training on prepared triplet data provided by the RAGatouille library.
Description
The RAGTrainer.train() method launches the ColBERT training loop. It constructs a ColBERTConfig from user parameters and auto-computed values (warmup steps, save frequency), then delegates to ColBERT.train() which initializes the colbert-ai Trainer with training data files and runs the optimization loop. The method returns the path to the best model checkpoint.
The delegation chain:
- RAGTrainer.train() → constructs config, delegates to model
- ColBERT.train() → merges configs (nway=2), creates colbert-ai Trainer, runs training
- colbert.Trainer.train() → executes the training loop
Usage
Call after prepare_training_data(). The returned checkpoint path can be used with RAGPretrainedModel.from_pretrained() for inference.
Code Reference
Source Location
- Repository: RAGatouille
- File: ragatouille/RAGTrainer.py
- Lines: L181-238
Signature
def train(
self,
batch_size: int = 32,
nbits: int = 2,
maxsteps: int = 500_000,
use_ib_negatives: bool = True,
learning_rate: float = 5e-6,
dim: int = 128,
doc_maxlen: int = 256,
use_relu: bool = False,
warmup_steps: Union[int, Literal["auto"]] = "auto",
accumsteps: int = 1,
) -> str:
"""
Launch training or fine-tuning of a ColBERT model.
Parameters:
batch_size: Total batch size (default 32).
nbits: Compression bits (default 2).
maxsteps: Max training steps (default 500,000).
use_ib_negatives: Use in-batch negatives (default True).
learning_rate: Learning rate (default 5e-6).
dim: Embedding dimension (default 128).
doc_maxlen: Max document length (default 256).
use_relu: Use ReLU on embeddings (default False).
warmup_steps: Warmup steps ("auto" = 10% of total).
accumsteps: Gradient accumulation steps (default 1).
Returns:
str: Path to the best model checkpoint.
"""
Import
from ragatouille import RAGTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch_size | int | No | Total batch size across GPUs (default 32) |
| nbits | int | No | Vector compression bits (default 2) |
| maxsteps | int | No | Maximum training steps (default 500,000) |
| use_ib_negatives | bool | No | Enable in-batch negatives (default True) |
| learning_rate | float | No | Learning rate (default 5e-6, recommended 3e-6 to 2e-5) |
| dim | int | No | Embedding dimension (default 128) |
| doc_maxlen | int | No | Maximum document token length (default 256) |
| use_relu | bool | No | Apply ReLU to embeddings (default False) |
| warmup_steps | Union[int, Literal["auto"]] | No | Warmup steps ("auto" = 10% of total, default "auto") |
| accumsteps | int | No | Gradient accumulation steps (default 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | str | Path to the best model checkpoint directory |
Usage Examples
Standard Fine-tuning
from ragatouille import RAGTrainer
trainer = RAGTrainer(
model_name="my_colbert",
pretrained_model_name="colbert-ir/colbertv2.0",
)
trainer.prepare_training_data(
raw_data=[("query", "relevant doc"), ...],
)
model_path = trainer.train(
batch_size=32,
learning_rate=5e-6,
maxsteps=100_000,
)
print(f"Best checkpoint: {model_path}")
Training with Custom Parameters
model_path = trainer.train(
batch_size=64,
nbits=2,
maxsteps=200_000,
use_ib_negatives=True,
learning_rate=1e-5,
dim=128,
doc_maxlen=512,
warmup_steps=1000,
accumsteps=2,
)