Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb String Similarity

From Leeroopedia
Revision as of 18:10, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Duckdb_Duckdb_String_Similarity.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains String_Processing, Fuzzy_Matching
Last Updated 2026-02-07 12:00 GMT

Overview

Algorithms for measuring the similarity between two strings based on character transpositions and common characters, providing a normalized score between 0 and 1 for fuzzy string comparison.

Description

Jaro and Jaro-Winkler are string similarity metrics designed for comparing short strings, particularly person names, addresses, and other record-linkage fields. Unlike edit-distance metrics (Levenshtein) that count the minimum operations to transform one string into another, Jaro similarity is based on the number and order of common characters between two strings.

The Jaro similarity between two strings is defined by three factors: the number of matching characters, the number of transpositions, and the lengths of the two strings. Two characters from the compared strings are considered matching only if they are the same and are within a certain distance threshold (defined as floor(max(|s1|, |s2|) / 2) - 1). The number of transpositions is half the number of matching characters that are in different positions in the two strings.

The Jaro-Winkler similarity extends Jaro by giving additional weight to strings that share a common prefix. The idea, proposed by William Winkler, is that typographical errors are less likely to occur at the beginning of a string. The Winkler modification adds a bonus proportional to the length of the common prefix (up to 4 characters), scaled by a prefix weight factor (typically 0.1). This boosts the similarity score for strings that start the same way, which is desirable for name matching.

Usage

String similarity functions are used in DuckDB through the `jaro_winkler_similarity()` and `jaro_similarity()` SQL functions. These are valuable for data cleaning, record linkage, deduplication, and fuzzy matching tasks. For example, joining customer records from different sources where names may have slight variations, or finding approximate matches in a lookup table.

Theoretical Basis

Jaro Similarity: Based on matching characters and transpositions:

function jaro_similarity(s1, s2):
    if s1 == s2: return 1.0
    if len(s1) == 0 or len(s2) == 0: return 0.0

    // Match window: characters must be within this distance
    match_distance = floor(max(len(s1), len(s2)) / 2) - 1

    // Find matching characters
    s1_matches = array[len(s1)] of false
    s2_matches = array[len(s2)] of false
    matches = 0
    transpositions = 0

    for i in 0..len(s1)-1:
        start = max(0, i - match_distance)
        end = min(i + match_distance + 1, len(s2))
        for j in start..end-1:
            if s2_matches[j] or s1[i] != s2[j]: continue
            s1_matches[i] = true
            s2_matches[j] = true
            matches += 1
            break

    if matches == 0: return 0.0

    // Count transpositions
    k = 0
    for i in 0..len(s1)-1:
        if not s1_matches[i]: continue
        while not s2_matches[k]: k += 1
        if s1[i] != s2[k]: transpositions += 1
        k += 1

    jaro = (matches/len(s1) + matches/len(s2) +
            (matches - transpositions/2)/matches) / 3.0
    return jaro

Jaro-Winkler Extension: Prefix bonus:

function jaro_winkler_similarity(s1, s2, p = 0.1):
    jaro = jaro_similarity(s1, s2)

    // Common prefix length (up to 4 characters)
    prefix_len = 0
    for i in 0..min(3, min(len(s1), len(s2))-1):
        if s1[i] == s2[i]: prefix_len += 1
        else: break

    // Winkler modification
    return jaro + prefix_len * p * (1.0 - jaro)
    // p is typically 0.1; prefix_len <= 4
    // This boosts similarity for strings sharing a prefix

Example Computation:

s1 = "MARTHA",  s2 = "MARHTA"
match_distance = floor(6/2) - 1 = 2
Matches: M-M, A-A, R-R, T-T, H-H, A-A = 6
Transpositions: T/H and H/T = 2 (1 pair)
Jaro = (6/6 + 6/6 + (6-1)/6) / 3 = (1 + 1 + 0.833) / 3 = 0.944
Common prefix "MAR" = 3 characters
Jaro-Winkler = 0.944 + 3 * 0.1 * (1 - 0.944) = 0.961

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment