Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:AnswerDotAI RAGatouille RAGTrainer Train

From Leeroopedia
Knowledge Sources
Domains NLP, Information_Retrieval, Training, Fine_Tuning
Last Updated 2026-02-12 12:00 GMT

Overview

Concrete tool for launching ColBERT model training on prepared triplet data provided by the RAGatouille library.

Description

The RAGTrainer.train() method launches the ColBERT training loop. It constructs a ColBERTConfig from user parameters and auto-computed values (warmup steps, save frequency), then delegates to ColBERT.train() which initializes the colbert-ai Trainer with training data files and runs the optimization loop. The method returns the path to the best model checkpoint.

The delegation chain:

  • RAGTrainer.train() → constructs config, delegates to model
  • ColBERT.train() → merges configs (nway=2), creates colbert-ai Trainer, runs training
  • colbert.Trainer.train() → executes the training loop

Usage

Call after prepare_training_data(). The returned checkpoint path can be used with RAGPretrainedModel.from_pretrained() for inference.

Code Reference

Source Location

  • Repository: RAGatouille
  • File: ragatouille/RAGTrainer.py
  • Lines: L181-238

Signature

def train(
    self,
    batch_size: int = 32,
    nbits: int = 2,
    maxsteps: int = 500_000,
    use_ib_negatives: bool = True,
    learning_rate: float = 5e-6,
    dim: int = 128,
    doc_maxlen: int = 256,
    use_relu: bool = False,
    warmup_steps: Union[int, Literal["auto"]] = "auto",
    accumsteps: int = 1,
) -> str:
    """
    Launch training or fine-tuning of a ColBERT model.

    Parameters:
        batch_size: Total batch size (default 32).
        nbits: Compression bits (default 2).
        maxsteps: Max training steps (default 500,000).
        use_ib_negatives: Use in-batch negatives (default True).
        learning_rate: Learning rate (default 5e-6).
        dim: Embedding dimension (default 128).
        doc_maxlen: Max document length (default 256).
        use_relu: Use ReLU on embeddings (default False).
        warmup_steps: Warmup steps ("auto" = 10% of total).
        accumsteps: Gradient accumulation steps (default 1).

    Returns:
        str: Path to the best model checkpoint.
    """

Import

from ragatouille import RAGTrainer

I/O Contract

Inputs

Name Type Required Description
batch_size int No Total batch size across GPUs (default 32)
nbits int No Vector compression bits (default 2)
maxsteps int No Maximum training steps (default 500,000)
use_ib_negatives bool No Enable in-batch negatives (default True)
learning_rate float No Learning rate (default 5e-6, recommended 3e-6 to 2e-5)
dim int No Embedding dimension (default 128)
doc_maxlen int No Maximum document token length (default 256)
use_relu bool No Apply ReLU to embeddings (default False)
warmup_steps Union[int, Literal["auto"]] No Warmup steps ("auto" = 10% of total, default "auto")
accumsteps int No Gradient accumulation steps (default 1)

Outputs

Name Type Description
return str Path to the best model checkpoint directory

Usage Examples

Standard Fine-tuning

from ragatouille import RAGTrainer

trainer = RAGTrainer(
    model_name="my_colbert",
    pretrained_model_name="colbert-ir/colbertv2.0",
)

trainer.prepare_training_data(
    raw_data=[("query", "relevant doc"), ...],
)

model_path = trainer.train(
    batch_size=32,
    learning_rate=5e-6,
    maxsteps=100_000,
)

print(f"Best checkpoint: {model_path}")

Training with Custom Parameters

model_path = trainer.train(
    batch_size=64,
    nbits=2,
    maxsteps=200_000,
    use_ib_negatives=True,
    learning_rate=1e-5,
    dim=128,
    doc_maxlen=512,
    warmup_steps=1000,
    accumsteps=2,
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment