Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Unicode Data

From Leeroopedia
Knowledge Sources
Domains Unicode, Tokenization
Last Updated 2026-02-15 00:00 GMT

Overview

Contains auto-generated Unicode character database tables used for codepoint classification, case mapping, and normalization.

Description

This file defines large static data structures generated by `scripts/gen-unicode-data.py`. The tables include `unicode_ranges_flags` which maps codepoint ranges to category bitflags (letter, number, punctuation, etc.), `unicode_set_whitespace` which lists whitespace codepoints, `unicode_map_lowercase` and `unicode_map_uppercase` which provide case conversion mappings, and `unicode_ranges_nfd` which contains NFD (Canonical Decomposition) normalization data. These tables cover the full Unicode range up to 0x110000.

Usage

This is a data-only file that provides Unicode character properties required by the tokenization system. Being auto-generated, it should not be manually edited; instead, the generation script `scripts/gen-unicode-data.py` should be re-run to update the data.

Code Reference

Source Location

Signature

// Codepoint range to category bitflags mapping
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags;

// Whitespace codepoint set
const std::unordered_set<uint32_t> unicode_set_whitespace;

// Case conversion mappings
const std::unordered_map<uint32_t, uint32_t> unicode_map_lowercase;
const std::unordered_map<uint32_t, uint32_t> unicode_map_uppercase;

// NFD normalization data
const std::vector<std::pair<uint32_t, std::vector<uint32_t>>> unicode_ranges_nfd;

Import

#include "unicode-data.h"
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <unordered_set>

I/O Contract

Inputs

Name Type Required Description
(none) - - This is a data-only file with no runtime inputs; data is generated by scripts/gen-unicode-data.py

Outputs

Name Type Description
unicode_ranges_flags initializer_list<pair<uint32_t, uint16_t>> Codepoint ranges with category bitflags (letter=0x0004, number=0x0002, punctuation=0x0020, etc.)
unicode_set_whitespace unordered_set<uint32_t> Set of Unicode whitespace codepoints
unicode_map_lowercase unordered_map<uint32_t, uint32_t> Uppercase-to-lowercase codepoint mappings
unicode_map_uppercase unordered_map<uint32_t, uint32_t> Lowercase-to-uppercase codepoint mappings
unicode_ranges_nfd vector<pair<uint32_t, vector<uint32_t>>> NFD canonical decomposition data

Usage Examples

// These tables are used by unicode.cpp functions, not called directly.
// Example of how the flags table is consumed:

// Look up codepoint category flags
// unicode_ranges_flags: {0x000041, 0x0004} means codepoints starting at 0x41 ('A') have flag 0x0004 (letter)
// unicode_ranges_flags: {0x000030, 0x0002} means codepoints starting at 0x30 ('0') have flag 0x0002 (number)

// Case conversion
// unicode_map_lowercase[0x0041] = 0x0061  ('A' -> 'a')
// unicode_map_uppercase[0x0061] = 0x0041  ('a' -> 'A')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment