Implementation:Duckdb Duckdb Jaro Winkler
| Knowledge Sources | |
|---|---|
| Domains | String_Similarity, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Jaro-Winkler is a string similarity library from the rapidfuzz project by Max Bachmann that provides efficient implementations of the Jaro and Jaro-Winkler similarity metrics, used by DuckDB for fuzzy string matching and similarity scoring.
Description
The implementation resides in the duckdb_jaro_winkler namespace and provides both one-shot and cached (precomputed) variants of the similarity functions:
Similarity Functions:
- jaro_similarity -- Computes the Jaro similarity between two strings, a value between 0.0 (no similarity) and 1.0 (exact match). The algorithm counts the number of matching characters (within a distance window of
max(len1, len2) / 2 - 1) and the number of transpositions, then combines them as:Sim = (matches/len1 + matches/len2 + (matches - transpositions/2)/matches) / 3.
- jaro_winkler_similarity -- Extends Jaro similarity with a prefix bonus. Common prefixes (up to 4 characters) receive extra weight controlled by
prefix_weight(default: 0.1, range: 0.0 to 0.25). The formula is:JW = Jaro + prefix_length * prefix_weight * (1 - Jaro).
Cached Variants:
- CachedJaroSimilarity<CharT> -- Precomputes a
BlockPatternMatchVectorfrom one string, enabling efficient repeated comparisons against multiple other strings. Providessimilarity()andnormalized_similarity()methods.
- CachedJaroWinklerSimilarity<CharT> -- Same as above but for Jaro-Winkler similarity, also storing the
prefix_weight.
Implementation Details:
The internal implementation (jaro_impl.hpp) uses bit-parallel algorithms with FlaggedCharsWord (single 64-bit word for short strings) and FlaggedCharsMultiword (vector of 64-bit words for longer strings) to efficiently compute character matches and transpositions using popcount and bit manipulation intrinsics.
The implementation includes early-exit optimizations based on string length filtering (jaro_length_filter) and common character count filtering (jaro_common_char_filter) to skip computation when the score cannot exceed score_cutoff.
Usage
DuckDB uses Jaro-Winkler similarity for the jaro_winkler_similarity() and jaro_similarity() scalar functions, which are commonly used in fuzzy matching, deduplication, record linkage, and data cleaning workflows. The cached variants are used when comparing one string against many others (e.g., in similarity joins or "did you mean?" suggestions).
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files:
Signature
namespace duckdb_jaro_winkler {
// Jaro-Winkler similarity (iterator-based)
template <typename InputIt1, typename InputIt2>
double jaro_winkler_similarity(InputIt1 first1, InputIt1 last1,
InputIt2 first2, InputIt2 last2,
double prefix_weight = 0.1,
double score_cutoff = 0.0);
// Jaro-Winkler similarity (container-based)
template <typename S1, typename S2>
double jaro_winkler_similarity(const S1& s1, const S2& s2,
double prefix_weight = 0.1,
double score_cutoff = 0.0);
// Jaro similarity (iterator-based)
template <typename InputIt1, typename InputIt2>
double jaro_similarity(InputIt1 first1, InputIt1 last1,
InputIt2 first2, InputIt2 last2,
double score_cutoff = 0.0);
// Jaro similarity (container-based)
template <typename S1, typename S2>
double jaro_similarity(const S1& s1, const S2& s2,
double score_cutoff = 0.0);
// Cached Jaro-Winkler (precompute pattern for repeated comparisons)
template <typename CharT1>
struct CachedJaroWinklerSimilarity {
template <typename InputIt1>
CachedJaroWinklerSimilarity(InputIt1 first1, InputIt1 last1,
double prefix_weight_ = 0.1);
template <typename S1>
CachedJaroWinklerSimilarity(const S1& s1_, double prefix_weight_ = 0.1);
template <typename InputIt2>
double similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
template <typename S2>
double similarity(const S2& s2, double score_cutoff = 0) const;
template <typename InputIt2>
double normalized_similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
template <typename S2>
double normalized_similarity(const S2& s2, double score_cutoff = 0) const;
};
// Cached Jaro (precompute pattern for repeated comparisons)
template <typename CharT1>
struct CachedJaroSimilarity {
template <typename InputIt1>
CachedJaroSimilarity(InputIt1 first1, InputIt1 last1);
template <typename S1>
CachedJaroSimilarity(const S1& s1_);
template <typename InputIt2>
double similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
template <typename S2>
double similarity(const S2& s2, double score_cutoff = 0) const;
};
} // namespace duckdb_jaro_winkler
Import
#include "jaro_winkler/jaro_winkler.hpp"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s1, s2 | string-like or iterator pair | Yes | The two strings to compare |
| prefix_weight | double | No | Weight for the common prefix bonus in Jaro-Winkler (default: 0.1, range: [0.0, 0.25]) |
| score_cutoff | double | No | Minimum score threshold; returns 0.0 if the computed similarity is below this (default: 0.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| return (jaro_similarity) | double | Jaro similarity score in [0.0, 1.0]; 0.0 if below score_cutoff |
| return (jaro_winkler_similarity) | double | Jaro-Winkler similarity score in [0.0, 1.0]; 0.0 if below score_cutoff |
Usage Examples
#include "jaro_winkler/jaro_winkler.hpp"
#include <string>
// Basic Jaro-Winkler similarity
std::string s1 = "martha";
std::string s2 = "marhta";
double jw = duckdb_jaro_winkler::jaro_winkler_similarity(s1, s2);
// jw ~ 0.961
// Basic Jaro similarity
double j = duckdb_jaro_winkler::jaro_similarity(s1, s2);
// j ~ 0.944
// With a score cutoff (returns 0.0 if below threshold)
double jw_cut = duckdb_jaro_winkler::jaro_winkler_similarity(
s1, s2, 0.1, 0.95);
// Cached variant for comparing one string against many
duckdb_jaro_winkler::CachedJaroWinklerSimilarity<char> cached("reference");
double sim1 = cached.similarity("referenc");
double sim2 = cached.similarity("diferent");
// Iterator-based interface
double jw_iter = duckdb_jaro_winkler::jaro_winkler_similarity(
s1.begin(), s1.end(), s2.begin(), s2.end());