Implementation:Duckdb Duckdb UTF8Proc
| Knowledge Sources | |
|---|---|
| Domains | Text_Processing, Third_Party |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
UTF8Proc is a Unicode processing library (version 2.9.0) embedded in DuckDB that provides UTF-8 validation, normalization, case conversion, grapheme cluster segmentation, and character property lookup.
Description
DuckDB integrates the utf8proc library (originally from the Julia project) through three layers:
- Core Library (
utf8proc.cpp, 825 lines) -- The upstream utf8proc implementation providing low-level Unicode operations: codepoint encoding/decoding, Unicode normalization forms (NFC, NFD, NFKC, NFKD), case folding, grapheme break detection, character width computation, and Unicode category classification. All operations work within theduckdbnamespace. - Unicode Data Tables (
utf8proc_data.cpp, 16960 lines) -- Auto-generated lookup tables containing Unicode character properties, decomposition mappings, case-folding rules, composition pairs, and grapheme break data. Included directly intoutf8proc.cppvia#include. - DuckDB Wrapper (
utf8proc_wrapper.cpp, 411 lines) -- A higher-level C++ interface (Utf8Procclass) providing DuckDB-specific conveniences: fast 8-byte-at-a-time ASCII detection, detailed UTF-8 validation with error position/reason reporting, invalid byte replacement or removal, NFC normalization, codepoint-to-UTF8 conversion, grapheme cluster iteration via a range-based for loop, and render width calculation.
The wrapper's Analyze function uses an optimized strategy: it reads 8 bytes at a time looking for any high-bit set, falling back to per-byte validation only when non-ASCII data is encountered. The GraphemeIterator class enables idiomatic C++ iteration over grapheme clusters (user-perceived characters) using utf8proc_grapheme_break_stateful.
Usage
DuckDB uses UTF8Proc pervasively throughout the engine:
- String Validation -- Every string value ingested by DuckDB is validated as UTF-8 using
Utf8Proc::AnalyzeandUtf8Proc::IsValid. - Unicode Normalization -- The
Utf8Proc::Normalizemethod applies NFC normalization for consistent string comparison. - Case Conversion -- Functions like
UPPER()andLOWER()useUtf8Proc::CodepointToUpperandUtf8Proc::CodepointToLower. - String Length and Substring -- Grapheme-aware string operations use
Utf8Proc::NextGraphemeClusterandUtf8Proc::GraphemeCountfor correct Unicode character counting. - Accent Stripping -- The
strip_accentsscalar function uses utf8proc decomposition to remove combining marks. - Render Width -- Terminal and display formatting uses
Utf8Proc::RenderWidthto compute East Asian character widths. - Error Recovery --
Utf8Proc::MakeValidandUtf8Proc::RemoveInvalidhandle malformed input gracefully.
Code Reference
Source Location
- Repository: Duckdb_Duckdb
- Files:
- third_party/utf8proc/utf8proc.cpp -- UTF-8 processing core (825 lines)
- third_party/utf8proc/utf8proc_data.cpp -- Unicode data tables (16960 lines)
- third_party/utf8proc/utf8proc_wrapper.cpp -- DuckDB wrapper (411 lines)
- third_party/utf8proc/include/utf8proc.hpp -- upstream header
- third_party/utf8proc/include/utf8proc_wrapper.hpp -- wrapper header
Signature
// DuckDB Wrapper API (utf8proc_wrapper.hpp)
namespace duckdb {
enum class UnicodeType { INVALID, ASCII, UTF8 };
enum class UnicodeInvalidReason { BYTE_MISMATCH, INVALID_UNICODE };
class Utf8Proc {
public:
// Validation
static UnicodeType Analyze(const char *s, size_t len,
UnicodeInvalidReason *invalid_reason = nullptr,
size_t *invalid_pos = nullptr);
static bool IsValid(const char *s, size_t len);
// Normalization
static char* Normalize(const char *s, size_t len);
// Error recovery
static void MakeValid(char *s, size_t len, char special_flag = '?');
static std::string RemoveInvalid(const char *s, size_t len);
// Grapheme cluster operations
static size_t NextGraphemeCluster(const char *s, size_t len, size_t pos);
static size_t PreviousGraphemeCluster(const char *s, size_t len, size_t pos);
static size_t GraphemeCount(const char *s, size_t len);
static GraphemeIterator GraphemeClusters(const char *s, size_t len);
// Codepoint conversion
static bool CodepointToUtf8(int cp, int &sz, char *c);
static int CodepointLength(int cp);
static int32_t UTF8ToCodepoint(const char *c, int &sz);
// Case conversion
static int32_t CodepointToUpper(int32_t codepoint);
static int32_t CodepointToLower(int32_t codepoint);
// Render width
static size_t RenderWidth(const char *s, size_t len, size_t pos);
static size_t RenderWidth(const std::string &str);
};
// Upstream utf8proc C API (utf8proc.hpp)
// Version: 2.9.0
const char *utf8proc_version(void);
utf8proc_ssize_t utf8proc_iterate(const utf8proc_uint8_t *str,
utf8proc_ssize_t strlen,
utf8proc_int32_t *codepoint_ref);
utf8proc_ssize_t utf8proc_encode_char(utf8proc_int32_t codepoint,
utf8proc_uint8_t *dst);
const utf8proc_property_struct *utf8proc_get_property(utf8proc_int32_t codepoint);
utf8proc_bool utf8proc_grapheme_break_stateful(utf8proc_int32_t cp1,
utf8proc_int32_t cp2,
utf8proc_int32_t *state);
utf8proc_int32_t utf8proc_toupper(utf8proc_int32_t cp);
utf8proc_int32_t utf8proc_tolower(utf8proc_int32_t cp);
utf8proc_uint8_t *utf8proc_NFC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKC(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
utf8proc_uint8_t *utf8proc_NFKD(const utf8proc_uint8_t *str, utf8proc_ssize_t len);
} // namespace duckdb
Import
#include "utf8proc_wrapper.hpp" // DuckDB Utf8Proc wrapper class
#include "utf8proc.hpp" // Upstream utf8proc C API within duckdb namespace
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
s |
const char * |
Yes | Pointer to the input string bytes (UTF-8 encoded or potentially invalid). |
len |
size_t |
Yes | Length of the input string in bytes. |
invalid_reason |
UnicodeInvalidReason * |
No | Output parameter set to BYTE_MISMATCH or INVALID_UNICODE on validation failure.
|
invalid_pos |
size_t * |
No | Output parameter set to the byte offset of the first invalid sequence. |
cp / codepoint |
int / int32_t |
Yes (for codepoint functions) | A Unicode codepoint value (0 to 0x10FFFF). |
pos |
size_t |
Yes (for grapheme functions) | Current byte position within the string for grapheme cluster traversal. |
special_flag |
char |
No | Replacement character for invalid bytes in MakeValid. Defaults to '?'. Must be <= 127 (ASCII).
|
Outputs
| Name | Type | Description |
|---|---|---|
Analyze return |
UnicodeType |
ASCII if all bytes are 0x00-0x7F, UTF8 if valid multi-byte sequences present, INVALID if malformed.
|
Normalize return |
char * |
Heap-allocated NFC-normalized string. Caller must free() the result.
|
IsValid return |
bool |
true if the string contains only valid UTF-8 sequences.
|
MakeValid |
void (in-place) |
Modifies the input buffer, replacing invalid bytes with special_flag.
|
RemoveInvalid return |
std::string |
New string with all invalid UTF-8 byte sequences removed. |
NextGraphemeCluster return |
size_t |
Byte position of the start of the next grapheme cluster after pos.
|
GraphemeCount return |
size_t |
Total number of grapheme clusters (user-perceived characters) in the string. |
UTF8ToCodepoint return |
int32_t |
Unicode codepoint decoded from the UTF-8 sequence at the given position. Sets sz to the byte length consumed.
|
CodepointToUtf8 return |
bool |
true on success, writes UTF-8 bytes to c and length to sz. false for surrogates or out-of-range codepoints.
|
RenderWidth return |
size_t |
Display width of a character or string, accounting for East Asian wide characters. |
Usage Examples
#include "utf8proc_wrapper.hpp"
using namespace duckdb;
// Validate a UTF-8 string with detailed error reporting
const char *input = "Hello, \xc3\x28 world";
UnicodeInvalidReason reason;
size_t error_pos;
UnicodeType type = Utf8Proc::Analyze(input, strlen(input), &reason, &error_pos);
if (type == UnicodeType::INVALID) {
// reason == UnicodeInvalidReason::BYTE_MISMATCH, error_pos == 8
printf("Invalid UTF-8 at byte %zu\n", error_pos);
}
// Quick validity check
if (Utf8Proc::IsValid("valid UTF-8 string", 18)) {
printf("String is valid UTF-8\n");
}
// NFC normalization (caller must free result)
const char *denormalized = "e\xcc\x81"; // 'e' + combining acute accent
char *normalized = Utf8Proc::Normalize(denormalized, 3);
// normalized now contains "\xc3\xa9" (precomposed e-acute)
free(normalized);
// Fix invalid UTF-8 in place
char buffer[] = "Hello\x80World";
Utf8Proc::MakeValid(buffer, sizeof(buffer) - 1, '?');
// buffer is now "Hello?World"
// Count grapheme clusters (user-perceived characters)
const char *emoji = "\xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9"; // family emoji
size_t count = Utf8Proc::GraphemeCount(emoji, strlen(emoji));
// Iterate over grapheme clusters
for (auto cluster : Utf8Proc::GraphemeClusters("cafe\xcc\x81", 6)) {
printf("Cluster: bytes %zu-%zu\n", cluster.start, cluster.end);
}
// Codepoint conversion
int32_t upper = Utf8Proc::CodepointToUpper(0x00E9); // e-acute -> E-acute (0x00C9)
int32_t lower = Utf8Proc::CodepointToLower(0x0041); // 'A' -> 'a' (0x0061)
// Render width (East Asian characters are width 2)
size_t width = Utf8Proc::RenderWidth(std::string("\xe4\xb8\xad")); // CJK character, width = 2