Principle:Duckdb Duckdb Unicode Processing
| Knowledge Sources | |
|---|---|
| Domains | Text_Processing, Character_Encoding |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A comprehensive system for encoding, normalizing, and transforming text represented in the Unicode character set, including UTF-8 encoding/decoding, case folding, and normalization forms.
Description
Unicode is the universal character encoding standard that assigns a unique code point to every character across all writing systems. UTF-8 is the dominant encoding of Unicode, using 1 to 4 bytes per character in a backward-compatible extension of ASCII. Correct Unicode processing is essential for any database system that handles international text data.
UTF-8 encoding uses a variable-width format where ASCII characters (U+0000 to U+007F) use one byte, characters up to U+07FF use two bytes, characters up to U+FFFF use three bytes, and characters up to U+10FFFF use four bytes. The encoding is self-synchronizing: the start of each character is identifiable from any position in the byte stream, which enables efficient random access and string operations.
Unicode normalization addresses the fact that some characters can be represented in multiple ways. For example, the character "e with accent" can be a single code point (U+00E9, precomposed) or two code points (U+0065 + U+0301, decomposed). Normalization converts all equivalent representations to a canonical form. The four normalization forms are NFC (composed), NFD (decomposed), NFKC (compatibility composed), and NFKD (compatibility decomposed).
Case folding is the process of converting characters to a common case for case-insensitive comparison. Unlike simple lowercasing, full case folding handles language-specific rules (e.g., German sharp-s, Turkish dotted-I) and characters that change length when case-converted.
Usage
Unicode processing is used throughout DuckDB's string functions. Functions like `upper()`, `lower()`, `length()`, `substr()`, `like`, and `similar to` must correctly handle multi-byte UTF-8 characters. String comparison for ORDER BY and JOIN operations requires proper Unicode collation. The normalization capabilities ensure consistent text matching regardless of how characters were originally encoded.
Theoretical Basis
UTF-8 Encoding Scheme:
Code Point Range | Byte 1 | Byte 2 | Byte 3 | Byte 4
U+0000 - U+007F | 0xxxxxxx | | |
U+0080 - U+07FF | 110xxxxx | 10xxxxxx | |
U+0800 - U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx |
U+10000 - U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx
// Encoding example: U+00E9 (e with accent) = 0xC3 0xA9
// 0x00E9 = 0000 0000 1110 1001
// -> 110_00011 10_101001 = 0xC3 0xA9
Character Length Counting: Counting characters (not bytes):
function utf8_strlen(bytes):
count = 0
i = 0
while i < len(bytes):
if bytes[i] < 0x80: i += 1 // 1-byte char
else if bytes[i] < 0xE0: i += 2 // 2-byte char
else if bytes[i] < 0xF0: i += 3 // 3-byte char
else: i += 4 // 4-byte char
count += 1
return count
Normalization: Converting to canonical form:
// NFD: Canonical Decomposition
// Decompose precomposed characters, then sort combining marks
"e\u0301" (e + combining accent) is already NFD
"\u00E9" (precomposed e-accent) -> "e\u0301"
// NFC: Canonical Decomposition + Canonical Composition
// First decompose (NFD), then recompose where possible
"e\u0301" -> "\u00E9"
// Canonical ordering: combining marks sorted by class
// "\u0327\u0301" (cedilla + accent) -> same order (class 202, 230)
// "\u0301\u0327" (accent + cedilla) -> reordered to "\u0327\u0301"
Case Folding:
// Simple case folding: 1-to-1 code point mapping
'A' -> 'a', 'B' -> 'b', ..., 'Z' -> 'z'
// Full case folding: may change string length
'\u00DF' (sharp-s) -> "ss" // 1 char -> 2 chars
'\u0130' (I-dot) -> "i\u0307" // 1 char -> 2 chars
// Locale-sensitive: Turkish/Azeri
'I' -> '\u0131' (dotless i) in Turkish
'I' -> 'i' in other languages