Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:AnswerDotAI RAGatouille Collection Size Index Tuning

From Leeroopedia
Knowledge Sources
Domains Indexing, Optimization, Information_Retrieval
Last Updated 2026-02-12 12:00 GMT

Overview

Adaptive index configuration that adjusts nbits compression and KMeans iterations based on collection size for optimal index quality and build speed.

Description

The PLAID index build process in RAGatouille automatically tunes two critical parameters based on collection size: the nbits quantization level (how many bits are used to compress residual vectors) and the kmeans_niters (number of KMeans clustering iterations). Smaller collections use higher precision (4-bit) and more iterations for better quality, while larger collections use lower precision (2-bit) and fewer iterations for faster builds without meaningful quality loss.

Usage

Use this heuristic to understand why RAGatouille behaves differently for different collection sizes and when manually tuning `nbits` via RAGTrainer.train(). This is especially relevant when deciding whether to split a very large corpus into multiple indexes.

The Insight (Rule of Thumb)

  • nbits:
    • Collection < 10,000 documents → `nbits=4` (higher precision)
    • Collection >= 10,000 documents → `nbits=2` (standard compression)
  • kmeans_niters:
    • Collection > 100,000 documents → `kmeans_niters=4` (fast)
    • Collection > 50,000 documents → `kmeans_niters=10` (moderate)
    • Collection <= 50,000 documents → `kmeans_niters=20` (thorough)
  • Trade-off: More nbits = better retrieval precision but larger index size. More KMeans iterations = better centroid quality but slower index build.

Reasoning

Small collections have fewer token embeddings, so the KMeans centroids can be computed precisely with more iterations and the quantization can afford higher precision without index size being prohibitive. For large collections, 2-bit quantization is sufficient because the centroids are well-distributed and the inverted index structure provides adequate precision. The ColBERTv2 paper demonstrates that 2-bit compression preserves retrieval quality for most use cases.

Code evidence from `ragatouille/models/index.py:168-183`:

nbits = 2
if len(collection) < 10000:
    nbits = 4
self.config = ColBERTConfig.from_existing(
    self.config, ColBERTConfig(nbits=nbits, index_bsize=bsize)
)

if len(collection) > 100000:
    self.config.kmeans_niters = 4
elif len(collection) > 50000:
    self.config.kmeans_niters = 10
else:
    self.config.kmeans_niters = 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment