Implementation:Ucbepic Docetl RankOperation Execute
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Document_Ranking |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for ranking documents by quality or relevance using multiple LLM-based and embedding-based evaluation strategies, provided by DocETL.
Description
The RankOperation class extends BaseOperation to rank documents according to user-defined criteria and direction (ascending or descending). It implements a multi-phase approach: first an initial ordering phase using either embedding similarity, Likert-scale LLM ratings, or calibrated embeddings, followed by a refinement phase using "picky" sliding windows where the LLM selects top items from progressively positioned windows. Each document receives a _rank field indicating its position in the final ordering.
Usage
Use this operation when you need to order documents by subjective quality, relevance, or any criterion that requires semantic understanding. Typical scenarios include ranking search results by relevance, prioritizing support tickets by urgency, ordering research papers by novelty, or selecting the best candidates from a pool based on complex criteria.
Code Reference
Source Location
- Repository: Ucbepic_Docetl
- File: docetl/operations/rank.py
- Lines: 1-1084
Signature
class RankOperation(BaseOperation):
class schema(BaseOperation.schema):
type: str = "order"
prompt: str
input_keys: list[str] = Field(default_factory=list)
direction: Literal["asc", "desc"]
model: str | None = None
embedding_model: str | None = None
batch_size: int = Field(10, gt=0)
initial_ordering_method: Literal["embedding", "likert", "calibrated_embedding"] = "embedding"
k: int | None = Field(None, gt=0)
rerank_call_budget: int = Field(100, gt=0)
num_top_items_per_window: int = Field(3, gt=0)
overlap_fraction: float = Field(0.5, ge=0, le=1)
timeout: int | None = Field(None, gt=0)
num_calibration_docs: int = Field(10, gt=0)
verbose: bool = False
litellm_completion_kwargs: dict[str, Any] = Field(default_factory=dict)
def _batch_rank_documents(self, batch, criteria, direction, model, ...) -> tuple[list[int], float]: ...
def _execute_comparison_qurk(self, input_data, sample=False) -> tuple[list[dict], float]: ...
def _execute_rating_embedding_qurk(self, input_data) -> tuple[list[dict], float]: ...
def _execute_sliding_window_qurk(self, input_data, ...) -> tuple[list[dict], float]: ...
def _execute_likert_rating_qurk(self, input_data) -> tuple[list[dict], float]: ...
def _execute_picky_window(self, window_docs, num_top_items) -> list[int]: ...
def _execute_calibrated_embedding_sort(self, input_data) -> tuple[list[dict], float]: ...
def execute(self, input_data: list[dict]) -> tuple[list[dict], float]: ...
Import
from docetl.operations.rank import RankOperation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_data | List[Dict] | Yes | Documents to rank |
| prompt | str | Yes | Ranking criteria description used by the LLM |
| direction | str | Yes | Ranking direction: "asc" (ascending) or "desc" (descending) |
| input_keys | List[str] | No | Keys to extract from documents for ranking (defaults to all keys) |
| initial_ordering_method | str | No | Method for initial ordering: "embedding", "likert", or "calibrated_embedding" (default "embedding") |
| k | int | No | Number of top elements to focus on (default: all documents) |
| rerank_call_budget | int | No | Number of LLM calls for sliding window refinement (default 100) |
| batch_size | int | No | Size of each comparison window (default 10) |
| num_top_items_per_window | int | No | Number of items the LLM picks per window (default 3) |
| overlap_fraction | float | No | Overlap fraction between windows (default 0.5) |
| model | str | No | LLM model for comparisons (defaults to pipeline default) |
| embedding_model | str | No | Model for embedding-based initial ordering |
Outputs
| Name | Type | Description |
|---|---|---|
| output | Tuple[List[Dict], float] | Ranked documents (each with a _rank field) and total cost |
Usage Examples
# YAML pipeline configuration for ranking
operations:
- name: rank_papers
type: order
prompt: "Rank by novelty and significance of the research contribution"
input_keys:
- title
- abstract
direction: desc
initial_ordering_method: likert
k: 50
rerank_call_budget: 20
batch_size: 10
model: "gpt-4o-mini"
# Python API usage
from docetl.operations.rank import RankOperation
config = {
"name": "rank_tickets",
"type": "order",
"prompt": "Rank by urgency and customer impact",
"input_keys": ["subject", "description"],
"direction": "desc",
"initial_ordering_method": "embedding",
"k": 20,
"rerank_call_budget": 50,
}
rank_op = RankOperation(runner, config, default_model, max_threads)
ranked_results, cost = rank_op.execute(input_data)
# Each result now has a "_rank" field (1 = highest ranked)