Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Duckdb Duckdb UTF8Proc

From Leeroopedia


Knowledge Sources
Domains Text_Processing, Third_Party
Last Updated 2026-02-07 12:00 GMT

Overview

UTF8Proc is a Unicode processing library (version 2.9.0) embedded in DuckDB that provides UTF-8 validation, normalization, case conversion, grapheme cluster segmentation, and character property lookup.

Description

DuckDB integrates the utf8proc library (originally from the Julia project) through three layers:

  1. Core Library (utf8proc.cpp, 825 lines) -- The upstream utf8proc implementation providing low-level Unicode operations: codepoint encoding/decoding, Unicode normalization forms (NFC, NFD, NFKC, NFKD), case folding, grapheme break detection, character width computation, and Unicode category classification. All operations work within the duckdb namespace.
  2. Unicode Data Tables (utf8proc_data.cpp, 16960 lines) -- Auto-generated lookup tables containing Unicode character properties, decomposition mappings, case-folding rules, composition pairs, and grapheme break data. Included directly into utf8proc.cpp via #include.
  3. DuckDB Wrapper (utf8proc_wrapper.cpp, 411 lines) -- A higher-level C++ interface (Utf8Proc class) providing DuckDB-specific conveniences: fast 8-byte-at-a-time ASCII detection, detailed UTF-8 validation with error position/reason reporting, invalid byte replacement or removal, NFC normalization, codepoint-to-UTF8 conversion, grapheme cluster iteration via a range-based for loop, and render width calculation.

The wrapper's Analyze function uses an optimized strategy: it reads 8 bytes at a time looking for any high-bit set, falling back to per-byte validation only when non-ASCII data is encountered. The GraphemeIterator class enables idiomatic C++ iteration over grapheme clusters (user-perceived characters) using utf8proc_grapheme_break_stateful.

Usage

DuckDB uses UTF8Proc pervasively throughout the engine:

  • String Validation -- Every string value ingested by DuckDB is validated as UTF-8 using Utf8Proc::Analyze and Utf8Proc::IsValid.
  • Unicode Normalization -- The Utf8Proc::Normalize method applies NFC normalization for consistent string comparison.
  • Case Conversion -- Functions like UPPER() and LOWER() use Utf8Proc::CodepointToUpper and Utf8Proc::CodepointToLower.
  • String Length and Substring -- Grapheme-aware string operations use Utf8Proc::NextGraphemeCluster and Utf8Proc::GraphemeCount for correct Unicode character counting.
  • Accent Stripping -- The strip_accents scalar function uses utf8proc decomposition to remove combining marks.
  • Render Width -- Terminal and display formatting uses Utf8Proc::RenderWidth to compute East Asian character widths.
  • Error Recovery -- Utf8Proc::MakeValid and Utf8Proc::RemoveInvalid handle malformed input gracefully.

Code Reference

Source Location

Signature

// DuckDB Wrapper API (utf8proc_wrapper.hpp)
namespace duckdb {

enum class UnicodeType { INVALID, ASCII, UTF8 };
enum class UnicodeInvalidReason { BYTE_MISMATCH, INVALID_UNICODE };

class Utf8Proc {
public:
    // Validation
    static UnicodeType Analyze(const char *s, size_t len,
                               UnicodeInvalidReason *invalid_reason = nullptr,
                               size_t *invalid_pos = nullptr);
    static bool IsValid(const char *s, size_t len);

    // Normalization
    static char* Normalize(const char *s, size_t len);

    // Error recovery
    static void MakeValid(char *s, size_t len, char special_flag = '?');
    static std::string RemoveInvalid(const char *s, size_t len);

    // Grapheme cluster operations
    static size_t NextGraphemeCluster(const char *s, size_t len, size_t pos);
    static size_t PreviousGraphemeCluster(const char *s, size_t len, size_t pos);
    static size_t GraphemeCount(const char *s, size_t len);
    static GraphemeIterator GraphemeClusters(const char *s, size_t len);

    // Codepoint conversion
    static bool CodepointToUtf8(int cp, int &sz, char *c);
    static int CodepointLength(int cp);
    static int32_t UTF8ToCodepoint(const char *c, int &sz);

    // Case conversion
    static int32_t CodepointToUpper(int32_t codepoint);
    static int32_t CodepointToLower(int32_t codepoint);

    // Render width
    static size_t RenderWidth(const char *s, size_t len, size_t pos);
    static size_t RenderWidth(const std::string &str);
};

// Upstream utf8proc C API (utf8proc.hpp)
// Version: 2.9.0
const char *utf8proc_version(void);
utf8proc_ssize_t utf8proc_iterate(const utf8proc_uint8_t *str,
                                   utf8proc_ssize_t strlen,
                                   utf8proc_int32_t *codepoint_ref);
utf8proc_ssize_t utf8proc_encode_char(utf8proc_int32_t codepoint,
                                       utf8proc_uint8_t *dst);
const utf8proc_property_struct *utf8proc_get_property(utf8proc_int32_t codepoint);
utf8proc_bool utf8proc_grapheme_break_stateful(utf8proc_int32_t cp1,
                                                utf8proc_int32_t cp2,
                                                utf8proc_int32_t *state);
utf8proc_int32_t utf8proc_toupper(utf8proc_int32_t cp);
utf8proc_int32_t utf8proc_tolower(utf8proc_int32_t cp);
utf8proc_uint8_t *utf8proc_NFC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);

} // namespace duckdb

Import

#include "utf8proc_wrapper.hpp"    // DuckDB Utf8Proc wrapper class
#include "utf8proc.hpp"            // Upstream utf8proc C API within duckdb namespace

I/O Contract

Inputs

Name Type Required Description
s const char * Yes Pointer to the input string bytes (UTF-8 encoded or potentially invalid).
len size_t Yes Length of the input string in bytes.
invalid_reason UnicodeInvalidReason * No Output parameter set to BYTE_MISMATCH or INVALID_UNICODE on validation failure.
invalid_pos size_t * No Output parameter set to the byte offset of the first invalid sequence.
cp / codepoint int / int32_t Yes (for codepoint functions) A Unicode codepoint value (0 to 0x10FFFF).
pos size_t Yes (for grapheme functions) Current byte position within the string for grapheme cluster traversal.
special_flag char No Replacement character for invalid bytes in MakeValid. Defaults to '?'. Must be <= 127 (ASCII).

Outputs

Name Type Description
Analyze return UnicodeType ASCII if all bytes are 0x00-0x7F, UTF8 if valid multi-byte sequences present, INVALID if malformed.
Normalize return char * Heap-allocated NFC-normalized string. Caller must free() the result.
IsValid return bool true if the string contains only valid UTF-8 sequences.
MakeValid void (in-place) Modifies the input buffer, replacing invalid bytes with special_flag.
RemoveInvalid return std::string New string with all invalid UTF-8 byte sequences removed.
NextGraphemeCluster return size_t Byte position of the start of the next grapheme cluster after pos.
GraphemeCount return size_t Total number of grapheme clusters (user-perceived characters) in the string.
UTF8ToCodepoint return int32_t Unicode codepoint decoded from the UTF-8 sequence at the given position. Sets sz to the byte length consumed.
CodepointToUtf8 return bool true on success, writes UTF-8 bytes to c and length to sz. false for surrogates or out-of-range codepoints.
RenderWidth return size_t Display width of a character or string, accounting for East Asian wide characters.

Usage Examples

#include "utf8proc_wrapper.hpp"

using namespace duckdb;

// Validate a UTF-8 string with detailed error reporting
const char *input = "Hello, \xc3\x28 world";
UnicodeInvalidReason reason;
size_t error_pos;
UnicodeType type = Utf8Proc::Analyze(input, strlen(input), &reason, &error_pos);
if (type == UnicodeType::INVALID) {
    // reason == UnicodeInvalidReason::BYTE_MISMATCH, error_pos == 8
    printf("Invalid UTF-8 at byte %zu\n", error_pos);
}

// Quick validity check
if (Utf8Proc::IsValid("valid UTF-8 string", 18)) {
    printf("String is valid UTF-8\n");
}

// NFC normalization (caller must free result)
const char *denormalized = "e\xcc\x81";  // 'e' + combining acute accent
char *normalized = Utf8Proc::Normalize(denormalized, 3);
// normalized now contains "\xc3\xa9" (precomposed e-acute)
free(normalized);

// Fix invalid UTF-8 in place
char buffer[] = "Hello\x80World";
Utf8Proc::MakeValid(buffer, sizeof(buffer) - 1, '?');
// buffer is now "Hello?World"

// Count grapheme clusters (user-perceived characters)
const char *emoji = "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9";  // family emoji
size_t count = Utf8Proc::GraphemeCount(emoji, strlen(emoji));

// Iterate over grapheme clusters
for (auto cluster : Utf8Proc::GraphemeClusters("cafe\xcc\x81", 6)) {
    printf("Cluster: bytes %zu-%zu\n", cluster.start, cluster.end);
}

// Codepoint conversion
int32_t upper = Utf8Proc::CodepointToUpper(0x00E9);  // e-acute -> E-acute (0x00C9)
int32_t lower = Utf8Proc::CodepointToLower(0x0041);  // 'A' -> 'a' (0x0061)

// Render width (East Asian characters are width 2)
size_t width = Utf8Proc::RenderWidth(std::string("\xe4\xb8\xad"));  // CJK character, width = 2

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment