Principle:Ucbepic Docetl Blocking Threshold Optimization

Knowledge Sources	Ucbepic_Docetl
Domains	LLM_Data_Processing, Record_Linkage
Last Updated	2026-02-08 00:00 GMT

Overview

Embedding-based blocking reduces pairwise comparison costs in equijoin and resolve operations by using cosine similarity thresholds on embedding vectors to create candidate pairs, avoiding O(n squared) LLM calls.

Theoretical Basis

In record linkage and entity resolution tasks, naively comparing every pair of records requires O(n squared) comparisons. When each comparison involves an LLM call, this becomes prohibitively expensive both in time and cost. Blocking is a classic technique from the database community that partitions records into groups and only compares records within the same block.

DocETL implements a modern variant of blocking using dense vector embeddings. Each record is embedded into a high-dimensional vector space, and cosine similarity between embedding vectors serves as a cheap proxy for semantic relatedness. Only pairs whose cosine similarity exceeds a configurable threshold are forwarded to the expensive LLM comparison step. This dramatically reduces the number of LLM calls while preserving high recall for true matches.

The key challenge is choosing the right threshold. Too high a threshold misses true matches (low recall); too low a threshold passes too many pairs through to LLM comparison (high cost). DocETL addresses this with a RuntimeBlockingOptimizer that samples a small number of pairs using stratified and exponential-weighted sampling across the similarity distribution, performs LLM comparisons on the sample, and then finds the highest threshold that achieves a target recall rate (default 95%). This auto-tuning approach removes the need for manual threshold selection while providing statistical guarantees on recall.

Key Design Decisions

Decision	Choice	Rationale
Similarity metric	Cosine similarity on dense embeddings	Captures semantic relatedness better than lexical overlap; works across diverse text types and lengths
Threshold selection	Auto-tuned via recall-targeted optimization on sampled pairs	Eliminates manual tuning while providing recall guarantees; sampling keeps calibration cost low
Sampling strategy	Hybrid stratified plus exponential-weighted sampling	Ensures coverage across the full similarity distribution while concentrating samples where matches are most likely (high-similarity region)

Related Pages

Implementation:Ucbepic_Docetl_Blocking_Utils

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment