Principle:Duckdb Duckdb String Similarity
| Knowledge Sources | |
|---|---|
| Domains | String_Processing, Fuzzy_Matching |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Algorithms for measuring the similarity between two strings based on character transpositions and common characters, providing a normalized score between 0 and 1 for fuzzy string comparison.
Description
Jaro and Jaro-Winkler are string similarity metrics designed for comparing short strings, particularly person names, addresses, and other record-linkage fields. Unlike edit-distance metrics (Levenshtein) that count the minimum operations to transform one string into another, Jaro similarity is based on the number and order of common characters between two strings.
The Jaro similarity between two strings is defined by three factors: the number of matching characters, the number of transpositions, and the lengths of the two strings. Two characters from the compared strings are considered matching only if they are the same and are within a certain distance threshold (defined as floor(max(|s1|, |s2|) / 2) - 1). The number of transpositions is half the number of matching characters that are in different positions in the two strings.
The Jaro-Winkler similarity extends Jaro by giving additional weight to strings that share a common prefix. The idea, proposed by William Winkler, is that typographical errors are less likely to occur at the beginning of a string. The Winkler modification adds a bonus proportional to the length of the common prefix (up to 4 characters), scaled by a prefix weight factor (typically 0.1). This boosts the similarity score for strings that start the same way, which is desirable for name matching.
Usage
String similarity functions are used in DuckDB through the `jaro_winkler_similarity()` and `jaro_similarity()` SQL functions. These are valuable for data cleaning, record linkage, deduplication, and fuzzy matching tasks. For example, joining customer records from different sources where names may have slight variations, or finding approximate matches in a lookup table.
Theoretical Basis
Jaro Similarity: Based on matching characters and transpositions:
function jaro_similarity(s1, s2):
if s1 == s2: return 1.0
if len(s1) == 0 or len(s2) == 0: return 0.0
// Match window: characters must be within this distance
match_distance = floor(max(len(s1), len(s2)) / 2) - 1
// Find matching characters
s1_matches = array[len(s1)] of false
s2_matches = array[len(s2)] of false
matches = 0
transpositions = 0
for i in 0..len(s1)-1:
start = max(0, i - match_distance)
end = min(i + match_distance + 1, len(s2))
for j in start..end-1:
if s2_matches[j] or s1[i] != s2[j]: continue
s1_matches[i] = true
s2_matches[j] = true
matches += 1
break
if matches == 0: return 0.0
// Count transpositions
k = 0
for i in 0..len(s1)-1:
if not s1_matches[i]: continue
while not s2_matches[k]: k += 1
if s1[i] != s2[k]: transpositions += 1
k += 1
jaro = (matches/len(s1) + matches/len(s2) +
(matches - transpositions/2)/matches) / 3.0
return jaro
Jaro-Winkler Extension: Prefix bonus:
function jaro_winkler_similarity(s1, s2, p = 0.1):
jaro = jaro_similarity(s1, s2)
// Common prefix length (up to 4 characters)
prefix_len = 0
for i in 0..min(3, min(len(s1), len(s2))-1):
if s1[i] == s2[i]: prefix_len += 1
else: break
// Winkler modification
return jaro + prefix_len * p * (1.0 - jaro)
// p is typically 0.1; prefix_len <= 4
// This boosts similarity for strings sharing a prefix
Example Computation:
s1 = "MARTHA", s2 = "MARHTA"
match_distance = floor(6/2) - 1 = 2
Matches: M-M, A-A, R-R, T-T, H-H, A-A = 6
Transpositions: T/H and H/T = 2 (1 pair)
Jaro = (6/6 + 6/6 + (6-1)/6) / 3 = (1 + 1 + 0.833) / 3 = 0.944
Common prefix "MAR" = 3 characters
Jaro-Winkler = 0.944 + 3 * 0.1 * (1 - 0.944) = 0.961