Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Unicode Processing

From Leeroopedia


Knowledge Sources
Domains Text_Processing, Character_Encoding
Last Updated 2026-02-07 12:00 GMT

Overview

A comprehensive system for encoding, normalizing, and transforming text represented in the Unicode character set, including UTF-8 encoding/decoding, case folding, and normalization forms.

Description

Unicode is the universal character encoding standard that assigns a unique code point to every character across all writing systems. UTF-8 is the dominant encoding of Unicode, using 1 to 4 bytes per character in a backward-compatible extension of ASCII. Correct Unicode processing is essential for any database system that handles international text data.

UTF-8 encoding uses a variable-width format where ASCII characters (U+0000 to U+007F) use one byte, characters up to U+07FF use two bytes, characters up to U+FFFF use three bytes, and characters up to U+10FFFF use four bytes. The encoding is self-synchronizing: the start of each character is identifiable from any position in the byte stream, which enables efficient random access and string operations.

Unicode normalization addresses the fact that some characters can be represented in multiple ways. For example, the character "e with accent" can be a single code point (U+00E9, precomposed) or two code points (U+0065 + U+0301, decomposed). Normalization converts all equivalent representations to a canonical form. The four normalization forms are NFC (composed), NFD (decomposed), NFKC (compatibility composed), and NFKD (compatibility decomposed).

Case folding is the process of converting characters to a common case for case-insensitive comparison. Unlike simple lowercasing, full case folding handles language-specific rules (e.g., German sharp-s, Turkish dotted-I) and characters that change length when case-converted.

Usage

Unicode processing is used throughout DuckDB's string functions. Functions like `upper()`, `lower()`, `length()`, `substr()`, `like`, and `similar to` must correctly handle multi-byte UTF-8 characters. String comparison for ORDER BY and JOIN operations requires proper Unicode collation. The normalization capabilities ensure consistent text matching regardless of how characters were originally encoded.

Theoretical Basis

UTF-8 Encoding Scheme:

Code Point Range         | Byte 1   | Byte 2   | Byte 3   | Byte 4
U+0000   - U+007F       | 0xxxxxxx |          |          |
U+0080   - U+07FF       | 110xxxxx | 10xxxxxx |          |
U+0800   - U+FFFF       | 1110xxxx | 10xxxxxx | 10xxxxxx |
U+10000  - U+10FFFF     | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx

// Encoding example: U+00E9 (e with accent) = 0xC3 0xA9
// 0x00E9 = 0000 0000 1110 1001
// -> 110_00011 10_101001 = 0xC3 0xA9

Character Length Counting: Counting characters (not bytes):

function utf8_strlen(bytes):
    count = 0
    i = 0
    while i < len(bytes):
        if bytes[i] < 0x80:       i += 1    // 1-byte char
        else if bytes[i] < 0xE0:  i += 2    // 2-byte char
        else if bytes[i] < 0xF0:  i += 3    // 3-byte char
        else:                      i += 4    // 4-byte char
        count += 1
    return count

Normalization: Converting to canonical form:

// NFD: Canonical Decomposition
// Decompose precomposed characters, then sort combining marks
"e\u0301" (e + combining accent) is already NFD
"\u00E9"  (precomposed e-accent) -> "e\u0301"

// NFC: Canonical Decomposition + Canonical Composition
// First decompose (NFD), then recompose where possible
"e\u0301" -> "\u00E9"

// Canonical ordering: combining marks sorted by class
// "\u0327\u0301" (cedilla + accent) -> same order (class 202, 230)
// "\u0301\u0327" (accent + cedilla) -> reordered to "\u0327\u0301"

Case Folding:

// Simple case folding: 1-to-1 code point mapping
'A' -> 'a', 'B' -> 'b', ..., 'Z' -> 'z'

// Full case folding: may change string length
'\u00DF' (sharp-s) -> "ss"        // 1 char -> 2 chars
'\u0130' (I-dot)   -> "i\u0307"   // 1 char -> 2 chars

// Locale-sensitive: Turkish/Azeri
'I' -> '\u0131' (dotless i) in Turkish
'I' -> 'i' in other languages

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment