Implementation:AnswerDotAI RAGatouille RAGPretrainedModel Index

Knowledge Sources	RAGatouille RAGatouille Docs
Domains	NLP, Information_Retrieval, Indexing
Last Updated	2026-02-12 12:00 GMT

Overview

Concrete tool for building a PLAID document index from a text collection provided by the RAGatouille library.

Description

The RAGPretrainedModel.index() method is the primary API for building a searchable document index. It orchestrates the full indexing pipeline: corpus processing (splitting documents into passages via CorpusProcessor), passage-to-document ID mapping, and delegation to ColBERT.index() which constructs the PLAID index via ModelIndexFactory. The method supports configurable document splitting, custom preprocessing functions, batch size control, and a choice between PyTorch KMeans (default for <75k docs) or FAISS for centroid computation.

The delegation chain is:

RAGPretrainedModel.index() → processes corpus, creates ID mappings
ColBERT.index() → configures index path, delegates to ModelIndexFactory
PLAIDModelIndex.build() → runs the colbert-ai Indexer with monkey-patched KMeans

Usage

Use this method after loading a pretrained model to build a persistent index over your document collection. The built index is stored on disk and can be loaded later with RAGPretrainedModel.from_index().

Code Reference

Source Location

Repository: RAGatouille
File: ragatouille/RAGPretrainedModel.py
Lines: L171-220

Signature

def index(
    self,
    collection: list[str],
    document_ids: Union[TypeVar("T"), List[TypeVar("T")]] = None,
    document_metadatas: Optional[list[dict]] = None,
    index_name: str = None,
    overwrite_index: Union[bool, str] = True,
    max_document_length: int = 256,
    split_documents: bool = True,
    document_splitter_fn: Optional[Callable] = llama_index_sentence_splitter,
    preprocessing_fn: Optional[Union[Callable, list[Callable]]] = None,
    bsize: int = 32,
    use_faiss: bool = False,
) -> str:
    """Build an index from a list of documents.

    Parameters:
        collection: The collection of documents to index.
        document_ids: Optional list of document IDs.
        document_metadatas: Optional list of metadata dicts.
        index_name: Name of the index to build.
        overwrite_index: Whether to overwrite existing index.
        max_document_length: Maximum passage length (256 default).
        split_documents: Whether to split documents into chunks.
        document_splitter_fn: Splitter function (default: llama_index_sentence_splitter).
        preprocessing_fn: Optional preprocessing function(s).
        bsize: Batch size for encoding passages (32 default).
        use_faiss: Use FAISS instead of PyTorch KMeans.

    Returns:
        str: Path to the built index directory.
    """

Import

from ragatouille import RAGPretrainedModel

I/O Contract

Inputs

Name	Type	Required	Description
collection	list[str]	Yes	List of document strings to index
document_ids	Union[T, List[T]]	No	Optional document IDs. Auto-generated UUIDs if not provided
document_metadatas	Optional[list[dict]]	No	Optional metadata dicts, one per document
index_name	str	No	Name for the index. Auto-generated if not provided
overwrite_index	Union[bool, str]	No	Whether to overwrite existing index (default True)
max_document_length	int	No	Maximum chunk length in tokens (default 256)
split_documents	bool	No	Whether to split documents into chunks (default True)
document_splitter_fn	Optional[Callable]	No	Splitter function (default: llama_index_sentence_splitter)
preprocessing_fn	Optional[Union[Callable, list[Callable]]]	No	Optional preprocessing function(s)
bsize	int	No	Encoding batch size (default 32)
use_faiss	bool	No	Use FAISS for KMeans (default False)

Outputs

Name	Type	Description
return	str	Path to the built index directory on disk

Usage Examples

Basic Indexing

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Index a collection of documents
my_documents = [
    "ColBERT is a fast and accurate retrieval model.",
    "It uses late interaction for efficient search.",
    "RAGatouille makes ColBERT easy to use.",
]

index_path = RAG.index(
    collection=my_documents,
    index_name="my_index",
)
print(f"Index built at: {index_path}")

Indexing with IDs and Metadata

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

documents = ["First document.", "Second document.", "Third document."]
doc_ids = ["doc1", "doc2", "doc3"]
metadatas = [
    {"source": "wiki", "date": "2024-01-01"},
    {"source": "blog", "date": "2024-02-01"},
    {"source": "paper", "date": "2024-03-01"},
]

index_path = RAG.index(
    collection=documents,
    document_ids=doc_ids,
    document_metadatas=metadatas,
    index_name="docs_with_metadata",
    max_document_length=512,
    bsize=64,
)

Indexing Without Document Splitting

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Pre-chunked passages — skip internal splitting
passages = ["Short passage one.", "Short passage two."]
index_path = RAG.index(
    collection=passages,
    index_name="pre_chunked",
    split_documents=False,
)

Related Pages

Implements Principle

Principle:AnswerDotAI_RAGatouille_Document_Indexing

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment