Principle:Ucbepic Docetl Top K Document Retrieval
| Knowledge Sources | |
|---|---|
| Domains | LLM_Data_Processing, Information_Retrieval |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Multi-strategy top-K retrieval selects the K most relevant documents from a collection using configurable scoring strategies including embedding similarity, BM25 full-text search, and LLM-based comparison ranking.
Theoretical Basis
Retrieving the most relevant documents from a collection is a fundamental information retrieval task that appears throughout data processing pipelines -- finding the most similar records to a query, selecting the best candidates for further analysis, or surfacing the most relevant context for downstream LLM operations. Different retrieval strategies offer different trade-offs between speed, cost, and quality.
DocETL's TopK operation provides a unified interface over three retrieval methods. The embedding method computes cosine similarity between query and document embeddings, selecting the K highest-scoring documents. This approach captures semantic similarity and works well when the query and documents share conceptual vocabulary but may use different words. The FTS (full-text search) method uses BM25 scoring (with TF-IDF fallback) for keyword-based retrieval, excelling when exact term matching is important. The LLM-compare method delegates to the rank operation to perform LLM-powered pairwise comparison, providing the highest quality results at the highest cost.
The TopK operation is architecturally implemented as a facade that delegates to either SampleOperation (for embedding and FTS methods) or RankOperation (for LLM-compare). This delegation pattern avoids code duplication while providing a cleaner, more intuitive API for the common retrieval use case. The embedding and FTS methods support stratified retrieval via a stratify_key parameter, allowing top-K selection within each group. The LLM-compare method leverages the full sliding-window ranking pipeline and returns only the top K results, combining the quality of LLM-based ranking with the efficiency of only materializing the top results.
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Three retrieval methods | Embedding similarity, BM25 full-text search, and LLM-based comparison | Provides a cost-quality spectrum: embedding is fast and semantic, FTS is fast and lexical, LLM-compare is expensive but highest quality |
| Facade architecture | Delegates to SampleOperation or RankOperation based on method | Reuses existing well-tested implementations; avoids code duplication while providing a simpler API for retrieval use cases |
| Stratification support | Optional stratify_key for per-group top-K retrieval (embedding and FTS only) | Ensures diverse results across groups; useful when retrieving top items from each category or partition |