Principle:Ucbepic Docetl LLM Powered Document Ranking

Knowledge Sources	Ucbepic_Docetl
Domains	LLM_Data_Processing, Information_Retrieval
Last Updated	2026-02-08 00:00 GMT

Overview

Prompt-guided document ranking uses a multi-phase approach combining initial ordering (via embeddings or LLM Likert ratings) with sliding-window LLM refinement to order documents by criteria specified in natural language prompts.

Theoretical Basis

Ordering a collection of documents by subjective or complex criteria -- such as "most relevant to climate policy" or "most actionable for a product manager" -- cannot be solved by simple sorting on numeric fields. It requires semantic understanding that only an LLM can provide. However, having an LLM perform all O(n squared) pairwise comparisons is prohibitively expensive. DocETL's rank operation draws on ideas from the human-powered sort literature to achieve high-quality rankings with a bounded LLM call budget.

The operation proceeds in two phases. The initial ordering phase produces a coarse ranking using one of three methods: (1) embedding similarity to the ranking criteria, which is fast and cheap but imprecise; (2) Likert-scale LLM ratings where each document is rated 1-7 against the criteria in parallel batches, providing more nuanced initial ordering; or (3) calibrated embedding sort that uses a small LLM-ranked sample to calibrate embedding-based ordering. The refinement phase then applies a sliding window approach: windows of configurable size move across the ranking, and within each window the LLM selects the top-K items ("picky windows"). Selected items are promoted to the front of the window, progressively refining the ranking. The total number of LLM calls in the refinement phase is bounded by a configurable budget parameter.

This two-phase design achieves a favorable trade-off: the initial ordering places most documents approximately correctly at low cost, while the sliding window refinement uses expensive LLM calls only where they have the most impact -- disambiguating items that are close in quality. The approach is particularly effective when only the top-K items matter, as the refinement can terminate early once the top positions are stable.

Key Design Decisions

Decision	Choice	Rationale
Initial ordering	Three strategies: embedding similarity, Likert LLM ratings, or calibrated embedding	Provides a cost-quality spectrum; embedding is cheapest, Likert is most accurate, calibrated embedding balances both
Refinement approach	Sliding picky windows with bounded LLM call budget	Concentrates expensive LLM calls where they have the most impact; budget parameter gives direct cost control
Direction support	Configurable ascending or descending ordering	Supports both "best first" and "worst first" use cases with the same underlying algorithm

Related Pages

Implementation:Ucbepic_Docetl_RankOperation_Execute

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment