Implementation:Duckdb Duckdb UTF8Proc

Knowledge Sources	Duckdb_Duckdb UTF8Proc
Domains	Text_Processing, Third_Party
Last Updated	2026-02-07 12:00 GMT

Overview

UTF8Proc is a Unicode processing library (version 2.9.0) embedded in DuckDB that provides UTF-8 validation, normalization, case conversion, grapheme cluster segmentation, and character property lookup.

Description

DuckDB integrates the utf8proc library (originally from the Julia project) through three layers:

Core Library (utf8proc.cpp, 825 lines) -- The upstream utf8proc implementation providing low-level Unicode operations: codepoint encoding/decoding, Unicode normalization forms (NFC, NFD, NFKC, NFKD), case folding, grapheme break detection, character width computation, and Unicode category classification. All operations work within the duckdb namespace.
Unicode Data Tables (utf8proc_data.cpp, 16960 lines) -- Auto-generated lookup tables containing Unicode character properties, decomposition mappings, case-folding rules, composition pairs, and grapheme break data. Included directly into utf8proc.cpp via #include.
DuckDB Wrapper (utf8proc_wrapper.cpp, 411 lines) -- A higher-level C++ interface (Utf8Proc class) providing DuckDB-specific conveniences: fast 8-byte-at-a-time ASCII detection, detailed UTF-8 validation with error position/reason reporting, invalid byte replacement or removal, NFC normalization, codepoint-to-UTF8 conversion, grapheme cluster iteration via a range-based for loop, and render width calculation.

The wrapper's Analyze function uses an optimized strategy: it reads 8 bytes at a time looking for any high-bit set, falling back to per-byte validation only when non-ASCII data is encountered. The GraphemeIterator class enables idiomatic C++ iteration over grapheme clusters (user-perceived characters) using utf8proc_grapheme_break_stateful.

Usage

DuckDB uses UTF8Proc pervasively throughout the engine:

String Validation -- Every string value ingested by DuckDB is validated as UTF-8 using Utf8Proc::Analyze and Utf8Proc::IsValid.
Unicode Normalization -- The Utf8Proc::Normalize method applies NFC normalization for consistent string comparison.
Case Conversion -- Functions like UPPER() and LOWER() use Utf8Proc::CodepointToUpper and Utf8Proc::CodepointToLower.
String Length and Substring -- Grapheme-aware string operations use Utf8Proc::NextGraphemeCluster and Utf8Proc::GraphemeCount for correct Unicode character counting.
Accent Stripping -- The strip_accents scalar function uses utf8proc decomposition to remove combining marks.
Render Width -- Terminal and display formatting uses Utf8Proc::RenderWidth to compute East Asian character widths.
Error Recovery -- Utf8Proc::MakeValid and Utf8Proc::RemoveInvalid handle malformed input gracefully.

Code Reference

Source Location

Repository: Duckdb_Duckdb
Files:
- third_party/utf8proc/utf8proc.cpp -- UTF-8 processing core (825 lines)
- third_party/utf8proc/utf8proc_data.cpp -- Unicode data tables (16960 lines)
- third_party/utf8proc/utf8proc_wrapper.cpp -- DuckDB wrapper (411 lines)
- third_party/utf8proc/include/utf8proc.hpp -- upstream header
- third_party/utf8proc/include/utf8proc_wrapper.hpp -- wrapper header

Signature

// DuckDB Wrapper API (utf8proc_wrapper.hpp)
namespace duckdb {

enum class UnicodeType { INVALID, ASCII, UTF8 };
enum class UnicodeInvalidReason { BYTE_MISMATCH, INVALID_UNICODE };

class Utf8Proc {
public:
    // Validation
    static UnicodeType Analyze(const char *s, size_t len,
                               UnicodeInvalidReason *invalid_reason = nullptr,
                               size_t *invalid_pos = nullptr);
    static bool IsValid(const char *s, size_t len);

    // Normalization
    static char* Normalize(const char *s, size_t len);

    // Error recovery
    static void MakeValid(char *s, size_t len, char special_flag = '?');
    static std::string RemoveInvalid(const char *s, size_t len);

    // Grapheme cluster operations
    static size_t NextGraphemeCluster(const char *s, size_t len, size_t pos);
    static size_t PreviousGraphemeCluster(const char *s, size_t len, size_t pos);
    static size_t GraphemeCount(const char *s, size_t len);
    static GraphemeIterator GraphemeClusters(const char *s, size_t len);

    // Codepoint conversion
    static bool CodepointToUtf8(int cp, int &sz, char *c);
    static int CodepointLength(int cp);
    static int32_t UTF8ToCodepoint(const char *c, int &sz);

    // Case conversion
    static int32_t CodepointToUpper(int32_t codepoint);
    static int32_t CodepointToLower(int32_t codepoint);

    // Render width
    static size_t RenderWidth(const char *s, size_t len, size_t pos);
    static size_t RenderWidth(const std::string &str);
};

// Upstream utf8proc C API (utf8proc.hpp)
// Version: 2.9.0
const char *utf8proc_version(void);
utf8proc_ssize_t utf8proc_iterate(const utf8proc_uint8_t *str,
                                   utf8proc_ssize_t strlen,
                                   utf8proc_int32_t *codepoint_ref);
utf8proc_ssize_t utf8proc_encode_char(utf8proc_int32_t codepoint,
                                       utf8proc_uint8_t *dst);
const utf8proc_property_struct *utf8proc_get_property(utf8proc_int32_t codepoint);
utf8proc_bool utf8proc_grapheme_break_stateful(utf8proc_int32_t cp1,
                                                utf8proc_int32_t cp2,
                                                utf8proc_int32_t *state);
utf8proc_int32_t utf8proc_toupper(utf8proc_int32_t cp);
utf8proc_int32_t utf8proc_tolower(utf8proc_int32_t cp);
utf8proc_uint8_t *utf8proc_NFC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);

} // namespace duckdb

Import

#include "utf8proc_wrapper.hpp"    // DuckDB Utf8Proc wrapper class
#include "utf8proc.hpp"            // Upstream utf8proc C API within duckdb namespace

I/O Contract

Inputs

Name	Type	Required	Description
`s`	`const char *`	Yes	Pointer to the input string bytes (UTF-8 encoded or potentially invalid).
`len`	`size_t`	Yes	Length of the input string in bytes.
`invalid_reason`	`UnicodeInvalidReason *`	No	Output parameter set to `BYTE_MISMATCH` or `INVALID_UNICODE` on validation failure.
`invalid_pos`	`size_t *`	No	Output parameter set to the byte offset of the first invalid sequence.
`cp` / `codepoint`	`int` / `int32_t`	Yes (for codepoint functions)	A Unicode codepoint value (0 to 0x10FFFF).
`pos`	`size_t`	Yes (for grapheme functions)	Current byte position within the string for grapheme cluster traversal.
`special_flag`	`char`	No	Replacement character for invalid bytes in `MakeValid`. Defaults to `'?'`. Must be <= 127 (ASCII).

Outputs

Name	Type	Description
`Analyze` return	`UnicodeType`	`ASCII` if all bytes are 0x00-0x7F, `UTF8` if valid multi-byte sequences present, `INVALID` if malformed.
`Normalize` return	`char *`	Heap-allocated NFC-normalized string. Caller must `free()` the result.
`IsValid` return	`bool`	`true` if the string contains only valid UTF-8 sequences.
`MakeValid`	`void` (in-place)	Modifies the input buffer, replacing invalid bytes with `special_flag`.
`RemoveInvalid` return	`std::string`	New string with all invalid UTF-8 byte sequences removed.
`NextGraphemeCluster` return	`size_t`	Byte position of the start of the next grapheme cluster after `pos`.
`GraphemeCount` return	`size_t`	Total number of grapheme clusters (user-perceived characters) in the string.
`UTF8ToCodepoint` return	`int32_t`	Unicode codepoint decoded from the UTF-8 sequence at the given position. Sets `sz` to the byte length consumed.
`CodepointToUtf8` return	`bool`	`true` on success, writes UTF-8 bytes to `c` and length to `sz`. `false` for surrogates or out-of-range codepoints.
`RenderWidth` return	`size_t`	Display width of a character or string, accounting for East Asian wide characters.

Usage Examples

#include "utf8proc_wrapper.hpp"

using namespace duckdb;

// Validate a UTF-8 string with detailed error reporting
const char *input = "Hello, \xc3\x28 world";
UnicodeInvalidReason reason;
size_t error_pos;
UnicodeType type = Utf8Proc::Analyze(input, strlen(input), &reason, &error_pos);
if (type == UnicodeType::INVALID) {
    // reason == UnicodeInvalidReason::BYTE_MISMATCH, error_pos == 8
    printf("Invalid UTF-8 at byte %zu\n", error_pos);
}

// Quick validity check
if (Utf8Proc::IsValid("valid UTF-8 string", 18)) {
    printf("String is valid UTF-8\n");
}

// NFC normalization (caller must free result)
const char *denormalized = "e\xcc\x81";  // 'e' + combining acute accent
char *normalized = Utf8Proc::Normalize(denormalized, 3);
// normalized now contains "\xc3\xa9" (precomposed e-acute)
free(normalized);

// Fix invalid UTF-8 in place
char buffer[] = "Hello\x80World";
Utf8Proc::MakeValid(buffer, sizeof(buffer) - 1, '?');
// buffer is now "Hello?World"

// Count grapheme clusters (user-perceived characters)
const char *emoji = "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9";  // family emoji
size_t count = Utf8Proc::GraphemeCount(emoji, strlen(emoji));

// Iterate over grapheme clusters
for (auto cluster : Utf8Proc::GraphemeClusters("cafe\xcc\x81", 6)) {
    printf("Cluster: bytes %zu-%zu\n", cluster.start, cluster.end);
}

// Codepoint conversion
int32_t upper = Utf8Proc::CodepointToUpper(0x00E9);  // e-acute -> E-acute (0x00C9)
int32_t lower = Utf8Proc::CodepointToLower(0x0041);  // 'A' -> 'a' (0x0061)

// Render width (East Asian characters are width 2)
size_t width = Utf8Proc::RenderWidth(std::string("\xe4\xb8\xad"));  // CJK character, width = 2

Related Pages

Principle:Duckdb_Duckdb_Unicode_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment