Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb Jaro Winkler

From Leeroopedia
Revision as of 14:50, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Duckdb_Duckdb_Jaro_Winkler.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains String_Similarity, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

Jaro-Winkler is a string similarity library from the rapidfuzz project by Max Bachmann that provides efficient implementations of the Jaro and Jaro-Winkler similarity metrics, used by DuckDB for fuzzy string matching and similarity scoring.

Description

The implementation resides in the duckdb_jaro_winkler namespace and provides both one-shot and cached (precomputed) variants of the similarity functions:

Similarity Functions:

  • jaro_similarity -- Computes the Jaro similarity between two strings, a value between 0.0 (no similarity) and 1.0 (exact match). The algorithm counts the number of matching characters (within a distance window of max(len1, len2) / 2 - 1) and the number of transpositions, then combines them as: Sim = (matches/len1 + matches/len2 + (matches - transpositions/2)/matches) / 3.
  • jaro_winkler_similarity -- Extends Jaro similarity with a prefix bonus. Common prefixes (up to 4 characters) receive extra weight controlled by prefix_weight (default: 0.1, range: 0.0 to 0.25). The formula is: JW = Jaro + prefix_length * prefix_weight * (1 - Jaro).

Cached Variants:

  • CachedJaroSimilarity<CharT> -- Precomputes a BlockPatternMatchVector from one string, enabling efficient repeated comparisons against multiple other strings. Provides similarity() and normalized_similarity() methods.
  • CachedJaroWinklerSimilarity<CharT> -- Same as above but for Jaro-Winkler similarity, also storing the prefix_weight.

Implementation Details:

The internal implementation (jaro_impl.hpp) uses bit-parallel algorithms with FlaggedCharsWord (single 64-bit word for short strings) and FlaggedCharsMultiword (vector of 64-bit words for longer strings) to efficiently compute character matches and transpositions using popcount and bit manipulation intrinsics.

The implementation includes early-exit optimizations based on string length filtering (jaro_length_filter) and common character count filtering (jaro_common_char_filter) to skip computation when the score cannot exceed score_cutoff.

Usage

DuckDB uses Jaro-Winkler similarity for the jaro_winkler_similarity() and jaro_similarity() scalar functions, which are commonly used in fuzzy matching, deduplication, record linkage, and data cleaning workflows. The cached variants are used when comparing one string against many others (e.g., in similarity joins or "did you mean?" suggestions).

Code Reference

Source Location

Signature

namespace duckdb_jaro_winkler {

// Jaro-Winkler similarity (iterator-based)
template <typename InputIt1, typename InputIt2>
double jaro_winkler_similarity(InputIt1 first1, InputIt1 last1,
                                InputIt2 first2, InputIt2 last2,
                                double prefix_weight = 0.1,
                                double score_cutoff = 0.0);

// Jaro-Winkler similarity (container-based)
template <typename S1, typename S2>
double jaro_winkler_similarity(const S1& s1, const S2& s2,
                                double prefix_weight = 0.1,
                                double score_cutoff = 0.0);

// Jaro similarity (iterator-based)
template <typename InputIt1, typename InputIt2>
double jaro_similarity(InputIt1 first1, InputIt1 last1,
                       InputIt2 first2, InputIt2 last2,
                       double score_cutoff = 0.0);

// Jaro similarity (container-based)
template <typename S1, typename S2>
double jaro_similarity(const S1& s1, const S2& s2,
                       double score_cutoff = 0.0);

// Cached Jaro-Winkler (precompute pattern for repeated comparisons)
template <typename CharT1>
struct CachedJaroWinklerSimilarity {
    template <typename InputIt1>
    CachedJaroWinklerSimilarity(InputIt1 first1, InputIt1 last1,
                                 double prefix_weight_ = 0.1);
    template <typename S1>
    CachedJaroWinklerSimilarity(const S1& s1_, double prefix_weight_ = 0.1);

    template <typename InputIt2>
    double similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
    template <typename S2>
    double similarity(const S2& s2, double score_cutoff = 0) const;

    template <typename InputIt2>
    double normalized_similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
    template <typename S2>
    double normalized_similarity(const S2& s2, double score_cutoff = 0) const;
};

// Cached Jaro (precompute pattern for repeated comparisons)
template <typename CharT1>
struct CachedJaroSimilarity {
    template <typename InputIt1>
    CachedJaroSimilarity(InputIt1 first1, InputIt1 last1);
    template <typename S1>
    CachedJaroSimilarity(const S1& s1_);

    template <typename InputIt2>
    double similarity(InputIt2 first2, InputIt2 last2, double score_cutoff = 0) const;
    template <typename S2>
    double similarity(const S2& s2, double score_cutoff = 0) const;
};

} // namespace duckdb_jaro_winkler

Import

#include "jaro_winkler/jaro_winkler.hpp"

I/O Contract

Inputs

Name Type Required Description
s1, s2 string-like or iterator pair Yes The two strings to compare
prefix_weight double No Weight for the common prefix bonus in Jaro-Winkler (default: 0.1, range: [0.0, 0.25])
score_cutoff double No Minimum score threshold; returns 0.0 if the computed similarity is below this (default: 0.0)

Outputs

Name Type Description
return (jaro_similarity) double Jaro similarity score in [0.0, 1.0]; 0.0 if below score_cutoff
return (jaro_winkler_similarity) double Jaro-Winkler similarity score in [0.0, 1.0]; 0.0 if below score_cutoff

Usage Examples

#include "jaro_winkler/jaro_winkler.hpp"
#include <string>

// Basic Jaro-Winkler similarity
std::string s1 = "martha";
std::string s2 = "marhta";
double jw = duckdb_jaro_winkler::jaro_winkler_similarity(s1, s2);
// jw ~ 0.961

// Basic Jaro similarity
double j = duckdb_jaro_winkler::jaro_similarity(s1, s2);
// j ~ 0.944

// With a score cutoff (returns 0.0 if below threshold)
double jw_cut = duckdb_jaro_winkler::jaro_winkler_similarity(
    s1, s2, 0.1, 0.95);

// Cached variant for comparing one string against many
duckdb_jaro_winkler::CachedJaroWinklerSimilarity<char> cached("reference");
double sim1 = cached.similarity("referenc");
double sim2 = cached.similarity("diferent");

// Iterator-based interface
double jw_iter = duckdb_jaro_winkler::jaro_winkler_similarity(
    s1.begin(), s1.end(), s2.begin(), s2.end());

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment