Principle:Ucbepic Docetl Blocking Threshold Optimization
| Knowledge Sources | |
|---|---|
| Domains | LLM_Data_Processing, Record_Linkage |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Embedding-based blocking reduces pairwise comparison costs in equijoin and resolve operations by using cosine similarity thresholds on embedding vectors to create candidate pairs, avoiding O(n squared) LLM calls.
Theoretical Basis
In record linkage and entity resolution tasks, naively comparing every pair of records requires O(n squared) comparisons. When each comparison involves an LLM call, this becomes prohibitively expensive both in time and cost. Blocking is a classic technique from the database community that partitions records into groups and only compares records within the same block.
DocETL implements a modern variant of blocking using dense vector embeddings. Each record is embedded into a high-dimensional vector space, and cosine similarity between embedding vectors serves as a cheap proxy for semantic relatedness. Only pairs whose cosine similarity exceeds a configurable threshold are forwarded to the expensive LLM comparison step. This dramatically reduces the number of LLM calls while preserving high recall for true matches.
The key challenge is choosing the right threshold. Too high a threshold misses true matches (low recall); too low a threshold passes too many pairs through to LLM comparison (high cost). DocETL addresses this with a RuntimeBlockingOptimizer that samples a small number of pairs using stratified and exponential-weighted sampling across the similarity distribution, performs LLM comparisons on the sample, and then finds the highest threshold that achieves a target recall rate (default 95%). This auto-tuning approach removes the need for manual threshold selection while providing statistical guarantees on recall.
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Similarity metric | Cosine similarity on dense embeddings | Captures semantic relatedness better than lexical overlap; works across diverse text types and lengths |
| Threshold selection | Auto-tuned via recall-targeted optimization on sampled pairs | Eliminates manual tuning while providing recall guarantees; sampling keeps calibration cost low |
| Sampling strategy | Hybrid stratified plus exponential-weighted sampling | Ensures coverage across the full similarity distribution while concentrating samples where matches are most likely (high-similarity region) |