Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ucbepic Docetl Blocking Threshold Optimization

From Leeroopedia


Knowledge Sources
Domains LLM_Data_Processing, Record_Linkage
Last Updated 2026-02-08 00:00 GMT

Overview

Embedding-based blocking reduces pairwise comparison costs in equijoin and resolve operations by using cosine similarity thresholds on embedding vectors to create candidate pairs, avoiding O(n squared) LLM calls.

Theoretical Basis

In record linkage and entity resolution tasks, naively comparing every pair of records requires O(n squared) comparisons. When each comparison involves an LLM call, this becomes prohibitively expensive both in time and cost. Blocking is a classic technique from the database community that partitions records into groups and only compares records within the same block.

DocETL implements a modern variant of blocking using dense vector embeddings. Each record is embedded into a high-dimensional vector space, and cosine similarity between embedding vectors serves as a cheap proxy for semantic relatedness. Only pairs whose cosine similarity exceeds a configurable threshold are forwarded to the expensive LLM comparison step. This dramatically reduces the number of LLM calls while preserving high recall for true matches.

The key challenge is choosing the right threshold. Too high a threshold misses true matches (low recall); too low a threshold passes too many pairs through to LLM comparison (high cost). DocETL addresses this with a RuntimeBlockingOptimizer that samples a small number of pairs using stratified and exponential-weighted sampling across the similarity distribution, performs LLM comparisons on the sample, and then finds the highest threshold that achieves a target recall rate (default 95%). This auto-tuning approach removes the need for manual threshold selection while providing statistical guarantees on recall.

Key Design Decisions

Decision Choice Rationale
Similarity metric Cosine similarity on dense embeddings Captures semantic relatedness better than lexical overlap; works across diverse text types and lengths
Threshold selection Auto-tuned via recall-targeted optimization on sampled pairs Eliminates manual tuning while providing recall guarantees; sampling keeps calibration cost low
Sampling strategy Hybrid stratified plus exponential-weighted sampling Ensures coverage across the full similarity distribution while concentrating samples where matches are most likely (high-similarity region)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment