Implementation:AnswerDotAI RAGatouille RAGPretrainedModel Index
| Knowledge Sources | |
|---|---|
| Domains | NLP, Information_Retrieval, Indexing |
| Last Updated | 2026-02-12 12:00 GMT |
Overview
Concrete tool for building a PLAID document index from a text collection provided by the RAGatouille library.
Description
The RAGPretrainedModel.index() method is the primary API for building a searchable document index. It orchestrates the full indexing pipeline: corpus processing (splitting documents into passages via CorpusProcessor), passage-to-document ID mapping, and delegation to ColBERT.index() which constructs the PLAID index via ModelIndexFactory. The method supports configurable document splitting, custom preprocessing functions, batch size control, and a choice between PyTorch KMeans (default for <75k docs) or FAISS for centroid computation.
The delegation chain is:
- RAGPretrainedModel.index() → processes corpus, creates ID mappings
- ColBERT.index() → configures index path, delegates to ModelIndexFactory
- PLAIDModelIndex.build() → runs the colbert-ai Indexer with monkey-patched KMeans
Usage
Use this method after loading a pretrained model to build a persistent index over your document collection. The built index is stored on disk and can be loaded later with RAGPretrainedModel.from_index().
Code Reference
Source Location
- Repository: RAGatouille
- File: ragatouille/RAGPretrainedModel.py
- Lines: L171-220
Signature
def index(
self,
collection: list[str],
document_ids: Union[TypeVar("T"), List[TypeVar("T")]] = None,
document_metadatas: Optional[list[dict]] = None,
index_name: str = None,
overwrite_index: Union[bool, str] = True,
max_document_length: int = 256,
split_documents: bool = True,
document_splitter_fn: Optional[Callable] = llama_index_sentence_splitter,
preprocessing_fn: Optional[Union[Callable, list[Callable]]] = None,
bsize: int = 32,
use_faiss: bool = False,
) -> str:
"""Build an index from a list of documents.
Parameters:
collection: The collection of documents to index.
document_ids: Optional list of document IDs.
document_metadatas: Optional list of metadata dicts.
index_name: Name of the index to build.
overwrite_index: Whether to overwrite existing index.
max_document_length: Maximum passage length (256 default).
split_documents: Whether to split documents into chunks.
document_splitter_fn: Splitter function (default: llama_index_sentence_splitter).
preprocessing_fn: Optional preprocessing function(s).
bsize: Batch size for encoding passages (32 default).
use_faiss: Use FAISS instead of PyTorch KMeans.
Returns:
str: Path to the built index directory.
"""
Import
from ragatouille import RAGPretrainedModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| collection | list[str] | Yes | List of document strings to index |
| document_ids | Union[T, List[T]] | No | Optional document IDs. Auto-generated UUIDs if not provided |
| document_metadatas | Optional[list[dict]] | No | Optional metadata dicts, one per document |
| index_name | str | No | Name for the index. Auto-generated if not provided |
| overwrite_index | Union[bool, str] | No | Whether to overwrite existing index (default True) |
| max_document_length | int | No | Maximum chunk length in tokens (default 256) |
| split_documents | bool | No | Whether to split documents into chunks (default True) |
| document_splitter_fn | Optional[Callable] | No | Splitter function (default: llama_index_sentence_splitter) |
| preprocessing_fn | Optional[Union[Callable, list[Callable]]] | No | Optional preprocessing function(s) |
| bsize | int | No | Encoding batch size (default 32) |
| use_faiss | bool | No | Use FAISS for KMeans (default False) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | str | Path to the built index directory on disk |
Usage Examples
Basic Indexing
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Index a collection of documents
my_documents = [
"ColBERT is a fast and accurate retrieval model.",
"It uses late interaction for efficient search.",
"RAGatouille makes ColBERT easy to use.",
]
index_path = RAG.index(
collection=my_documents,
index_name="my_index",
)
print(f"Index built at: {index_path}")
Indexing with IDs and Metadata
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
documents = ["First document.", "Second document.", "Third document."]
doc_ids = ["doc1", "doc2", "doc3"]
metadatas = [
{"source": "wiki", "date": "2024-01-01"},
{"source": "blog", "date": "2024-02-01"},
{"source": "paper", "date": "2024-03-01"},
]
index_path = RAG.index(
collection=documents,
document_ids=doc_ids,
document_metadatas=metadatas,
index_name="docs_with_metadata",
max_document_length=512,
bsize=64,
)
Indexing Without Document Splitting
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Pre-chunked passages — skip internal splitting
passages = ["Short passage one.", "Short passage two."]
index_path = RAG.index(
collection=passages,
index_name="pre_chunked",
split_documents=False,
)
Related Pages
Implements Principle
Requires Environment
- Environment:AnswerDotAI_RAGatouille_Python_ColBERT_Dependencies
- Environment:AnswerDotAI_RAGatouille_GPU_CUDA_Runtime