Principle:Ucbepic Docetl Top K Document Retrieval

Knowledge Sources	Ucbepic_Docetl
Domains	LLM_Data_Processing, Information_Retrieval
Last Updated	2026-02-08 00:00 GMT

Overview

Multi-strategy top-K retrieval selects the K most relevant documents from a collection using configurable scoring strategies including embedding similarity, BM25 full-text search, and LLM-based comparison ranking.

Theoretical Basis

Retrieving the most relevant documents from a collection is a fundamental information retrieval task that appears throughout data processing pipelines -- finding the most similar records to a query, selecting the best candidates for further analysis, or surfacing the most relevant context for downstream LLM operations. Different retrieval strategies offer different trade-offs between speed, cost, and quality.

DocETL's TopK operation provides a unified interface over three retrieval methods. The embedding method computes cosine similarity between query and document embeddings, selecting the K highest-scoring documents. This approach captures semantic similarity and works well when the query and documents share conceptual vocabulary but may use different words. The FTS (full-text search) method uses BM25 scoring (with TF-IDF fallback) for keyword-based retrieval, excelling when exact term matching is important. The LLM-compare method delegates to the rank operation to perform LLM-powered pairwise comparison, providing the highest quality results at the highest cost.

The TopK operation is architecturally implemented as a facade that delegates to either SampleOperation (for embedding and FTS methods) or RankOperation (for LLM-compare). This delegation pattern avoids code duplication while providing a cleaner, more intuitive API for the common retrieval use case. The embedding and FTS methods support stratified retrieval via a stratify_key parameter, allowing top-K selection within each group. The LLM-compare method leverages the full sliding-window ranking pipeline and returns only the top K results, combining the quality of LLM-based ranking with the efficiency of only materializing the top results.

Key Design Decisions

Decision	Choice	Rationale
Three retrieval methods	Embedding similarity, BM25 full-text search, and LLM-based comparison	Provides a cost-quality spectrum: embedding is fast and semantic, FTS is fast and lexical, LLM-compare is expensive but highest quality
Facade architecture	Delegates to SampleOperation or RankOperation based on method	Reuses existing well-tested implementations; avoids code duplication while providing a simpler API for retrieval use cases
Stratification support	Optional stratify_key for per-group top-K retrieval (embedding and FTS only)	Ensures diverse results across groups; useful when retrieving top items from each category or partition

Related Pages

Implementation:Ucbepic_Docetl_TopKOperation_Execute

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment