Implementation:Ucbepic Docetl TopKOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Information_Retrieval |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for retrieving the top K documents by relevance using embeddings, full-text search, or LLM-based comparison, provided by DocETL.
Description
The TopKOperation class extends BaseOperation to provide a unified interface for top-K document retrieval. It acts as a facade that delegates to the appropriate underlying operation based on the chosen method: for "embedding" and "fts" methods, it delegates to SampleOperation (using top_embedding or top_fts); for "llm_compare", it delegates to RankOperation and returns the top K results. This design provides a clean, purpose-specific API while abstracting away the complexity of choosing between sampling and ranking strategies.
Usage
Use this operation when you need to retrieve the most relevant documents from a collection based on a query. Typical scenarios include finding the most similar documents to a search query, selecting the best candidates for further processing, or implementing retrieval-augmented generation (RAG) pipelines. Choose "embedding" for semantic similarity, "fts" for keyword-based matching, or "llm_compare" for the most accurate but costlier LLM-based ranking.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/topk.py
- Lines: 1-235
Signature
class TopKOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "topk"
method: Literal["embedding", "fts", "llm_compare"]
k: Union[int, float] = Field(..., description="Number of items to retrieve")
keys: list[str] = Field(..., description="Keys to use for similarity matching")
query: str = Field(..., description="Query string (supports Jinja templates)")
stratify_key: Union[str, list[str]] | None = Field(None)
embedding_model: str | None = Field("text-embedding-3-small")
model: str | None = Field(None, description="LLM model for llm_compare method")
batch_size: int | None = Field(10)
def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...
def _execute_llm_compare(self, input_data, is_build=False) -> tuple[list[dict], float]: ...
Import
from docetl.operations.topk import TopKOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Documents to retrieve from |
| method | str | Yes | Retrieval method: "embedding" (semantic), "fts" (full-text), or "llm_compare" (LLM ranking) |
| k | int or float | Yes | Number of items to retrieve (int) or fraction of total (float) |
| keys | List[str] | Yes | Document keys to use for similarity matching or comparison |
| query | str | Yes | Query string or ranking criteria (supports Jinja templates for embedding/fts) |
| stratify_key | str or List[str] | No | Key(s) for stratified retrieval (not supported with llm_compare) |
| embedding_model | str | No | Embedding model (default "text-embedding-3-small", used by embedding and llm_compare) |
| model | str | No | LLM model (required for llm_compare method) |
| batch_size | int | No | Batch size for LLM comparisons in llm_compare (default 10) |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Top K documents (with _rank and _score fields for embedding/fts) and total cost |
Usage Examples
# YAML pipeline configuration for embedding-based top-K
operations:
- name: find_similar
type: topk
method: embedding
k: 10
keys:
- title
- content
query: "machine learning for drug discovery"
embedding_model: "text-embedding-3-small"
# YAML pipeline configuration for full-text search top-K
operations:
- name: keyword_search
type: topk
method: fts
k: 20
keys:
- title
- abstract
query: "transformer attention mechanism"
# YAML pipeline configuration for LLM-based comparison top-K
operations:
- name: best_papers
type: topk
method: llm_compare
k: 5
keys:
- title
- abstract
query: "Select papers with the most novel contributions to NLP"
model: "gpt-4o-mini"
# Python API usage
from docetl.operations.topk import TopKOperation
config = {
"name": "retrieve_relevant",
"type": "topk",
"method": "embedding",
"k": 10,
"keys": ["title", "content"],
"query": "recent advances in protein folding",
}
topk_op = TopKOperation(runner, config, default_model, max_threads)
top_results, cost = topk_op.execute(documents)
# Results have _retrieve_relevant_rank and _retrieve_relevant_score fields