Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl TopKOperation Execute

From Leeroopedia
Revision as of 17:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ucbepic_Docetl_TopKOperation_Execute.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data_Processing, Information_Retrieval
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for retrieving the top K documents by relevance using embeddings, full-text search, or LLM-based comparison, provided by DocETL.

Description

The TopKOperation class extends BaseOperation to provide a unified interface for top-K document retrieval. It acts as a facade that delegates to the appropriate underlying operation based on the chosen method: for "embedding" and "fts" methods, it delegates to SampleOperation (using top_embedding or top_fts); for "llm_compare", it delegates to RankOperation and returns the top K results. This design provides a clean, purpose-specific API while abstracting away the complexity of choosing between sampling and ranking strategies.

Usage

Use this operation when you need to retrieve the most relevant documents from a collection based on a query. Typical scenarios include finding the most similar documents to a search query, selecting the best candidates for further processing, or implementing retrieval-augmented generation (RAG) pipelines. Choose "embedding" for semantic similarity, "fts" for keyword-based matching, or "llm_compare" for the most accurate but costlier LLM-based ranking.

Code Reference

Source Location

Signature

class TopKOperation(BaseOperation):
    class schema(BaseOperation.schema):
        type: str = "topk"
        method: Literal["embedding", "fts", "llm_compare"]
        k: Union[int, float] = Field(..., description="Number of items to retrieve")
        keys: list[str] = Field(..., description="Keys to use for similarity matching")
        query: str = Field(..., description="Query string (supports Jinja templates)")
        stratify_key: Union[str, list[str]] | None = Field(None)
        embedding_model: str | None = Field("text-embedding-3-small")
        model: str | None = Field(None, description="LLM model for llm_compare method")
        batch_size: int | None = Field(10)

    def execute(self, input_data: list[dict], is_build: bool = False) -> tuple[list[dict], float]: ...

    def _execute_llm_compare(self, input_data, is_build=False) -> tuple[list[dict], float]: ...

Import

from docetl.operations.topk import TopKOperation

I/O Contract

Inputs

Name Type Required Description
input_data List[Dict] Yes Documents to retrieve from
method str Yes Retrieval method: "embedding" (semantic), "fts" (full-text), or "llm_compare" (LLM ranking)
k int or float Yes Number of items to retrieve (int) or fraction of total (float)
keys List[str] Yes Document keys to use for similarity matching or comparison
query str Yes Query string or ranking criteria (supports Jinja templates for embedding/fts)
stratify_key str or List[str] No Key(s) for stratified retrieval (not supported with llm_compare)
embedding_model str No Embedding model (default "text-embedding-3-small", used by embedding and llm_compare)
model str No LLM model (required for llm_compare method)
batch_size int No Batch size for LLM comparisons in llm_compare (default 10)

Outputs

Name Type Description
output Tuple[List[Dict], float] Top K documents (with _rank and _score fields for embedding/fts) and total cost

Usage Examples

# YAML pipeline configuration for embedding-based top-K
operations:
  - name: find_similar
    type: topk
    method: embedding
    k: 10
    keys:
      - title
      - content
    query: "machine learning for drug discovery"
    embedding_model: "text-embedding-3-small"

# YAML pipeline configuration for full-text search top-K
operations:
  - name: keyword_search
    type: topk
    method: fts
    k: 20
    keys:
      - title
      - abstract
    query: "transformer attention mechanism"

# YAML pipeline configuration for LLM-based comparison top-K
operations:
  - name: best_papers
    type: topk
    method: llm_compare
    k: 5
    keys:
      - title
      - abstract
    query: "Select papers with the most novel contributions to NLP"
    model: "gpt-4o-mini"
# Python API usage
from docetl.operations.topk import TopKOperation

config = {
    "name": "retrieve_relevant",
    "type": "topk",
    "method": "embedding",
    "k": 10,
    "keys": ["title", "content"],
    "query": "recent advances in protein folding",
}
topk_op = TopKOperation(runner, config, default_model, max_threads)
top_results, cost = topk_op.execute(documents)
# Results have _retrieve_relevant_rank and _retrieve_relevant_score fields

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment