Implementation:Ucbepic Docetl TopKOperation Execute

Knowledge Sources	Ucbepic_Docetl DocETL Docs
Domains	Data_Processing, Information_Retrieval
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for retrieving the top K documents by relevance using embeddings, full-text search, or LLM-based comparison, provided by DocETL.

Description

The TopKOperation class extends BaseOperation to provide a unified interface for top-K document retrieval. It acts as a facade that delegates to the appropriate underlying operation based on the chosen method: for "embedding" and "fts" methods, it delegates to SampleOperation (using top_embedding or top_fts); for "llm_compare", it delegates to RankOperation and returns the top K results. This design provides a clean, purpose-specific API while abstracting away the complexity of choosing between sampling and ranking strategies.

Usage

Use this operation when you need to retrieve the most relevant documents from a collection based on a query. Typical scenarios include finding the most similar documents to a search query, selecting the best candidates for further processing, or implementing retrieval-augmented generation (RAG) pipelines. Choose "embedding" for semantic similarity, "fts" for keyword-based matching, or "llm_compare" for the most accurate but costlier LLM-based ranking.

Code Reference

Source Location

Repository: Ucbepic_Docetl
File: docetl/operations/topk.py
Lines: 1-235

Signature

class TopKOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "topk"
        method: Literal["embedding", "fts", "llm_compare"]
        k: Union[int, float] = Field(..., description="Number of items to retrieve")
        keys: list[str] = Field(..., description="Keys to use for similarity matching")
        query: str = Field(..., description="Query string (supports Jinja templates)")
        stratify_key: Union[str, list[str]] | None = Field(None)
        embedding_model: str | None = Field("text-embedding-3-small")
        model: str | None = Field(None, description="LLM model for llm_compare method")
        batch_size: int | None = Field(10)

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

    def _execute_llm_compare(self, input_data, is_build=False) -> tuple[list[dict], float]: ...

Import

from docetl.operations.topk import TopKOperation

I/O Contract

Inputs

Name	Type	Required	Description
input_data	List[Dict]	Yes	Documents to retrieve from
method	str	Yes	Retrieval method: "embedding" (semantic), "fts" (full-text), or "llm_compare" (LLM ranking)
k	int or float	Yes	Number of items to retrieve (int) or fraction of total (float)
keys	List[str]	Yes	Document keys to use for similarity matching or comparison
query	str	Yes	Query string or ranking criteria (supports Jinja templates for embedding/fts)
stratify_key	str or List[str]	No	Key(s) for stratified retrieval (not supported with llm_compare)
embedding_model	str	No	Embedding model (default "text-embedding-3-small", used by embedding and llm_compare)
model	str	No	LLM model (required for llm_compare method)
batch_size	int	No	Batch size for LLM comparisons in llm_compare (default 10)

Outputs

Name	Type	Description
output	Tuple[List[Dict], float]	Top K documents (with _rank and _score fields for embedding/fts) and total cost

Usage Examples

# YAML pipeline configuration for embedding-based top-K
operations:
  - name: find_similar
    type: topk
    method: embedding
    k: 10
    keys:
      - title
      - content
    query: "machine learning for drug discovery"
    embedding_model: "text-embedding-3-small"

# YAML pipeline configuration for full-text search top-K
operations:
  - name: keyword_search
    type: topk
    method: fts
    k: 20
    keys:
      - title
      - abstract
    query: "transformer attention mechanism"

# YAML pipeline configuration for LLM-based comparison top-K
operations:
  - name: best_papers
    type: topk
    method: llm_compare
    k: 5
    keys:
      - title
      - abstract
    query: "Select papers with the most novel contributions to NLP"
    model: "gpt-4o-mini"

# Python API usage
from docetl.operations.topk import TopKOperation

config = {
    "name": "retrieve_relevant",
    "type": "topk",
    "method": "embedding",
    "k": 10,
    "keys": ["title", "content"],
    "query": "recent advances in protein folding",
}
topk_op = TopKOperation(runner, config, default_model, max_threads)
top_results, cost = topk_op.execute(documents)
# Results have _retrieve_relevant_rank and _retrieve_relevant_score fields

Related Pages

Principle:Ucbepic_Docetl_Top_K_Document_Retrieval

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment