Implementation:Ggml org Llama cpp Unicode Data
| Knowledge Sources | |
|---|---|
| Domains | Unicode, Tokenization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Contains auto-generated Unicode character database tables used for codepoint classification, case mapping, and normalization.
Description
This file defines large static data structures generated by `scripts/gen-unicode-data.py`. The tables include `unicode_ranges_flags` which maps codepoint ranges to category bitflags (letter, number, punctuation, etc.), `unicode_set_whitespace` which lists whitespace codepoints, `unicode_map_lowercase` and `unicode_map_uppercase` which provide case conversion mappings, and `unicode_ranges_nfd` which contains NFD (Canonical Decomposition) normalization data. These tables cover the full Unicode range up to 0x110000.
Usage
This is a data-only file that provides Unicode character properties required by the tokenization system. Being auto-generated, it should not be manually edited; instead, the generation script `scripts/gen-unicode-data.py` should be re-run to update the data.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/unicode-data.cpp
- Lines: 1-7034
Signature
// Codepoint range to category bitflags mapping
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags;
// Whitespace codepoint set
const std::unordered_set<uint32_t> unicode_set_whitespace;
// Case conversion mappings
const std::unordered_map<uint32_t, uint32_t> unicode_map_lowercase;
const std::unordered_map<uint32_t, uint32_t> unicode_map_uppercase;
// NFD normalization data
const std::vector<std::pair<uint32_t, std::vector<uint32_t>>> unicode_ranges_nfd;
Import
#include "unicode-data.h"
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <unordered_set>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | - | - | This is a data-only file with no runtime inputs; data is generated by scripts/gen-unicode-data.py |
Outputs
| Name | Type | Description |
|---|---|---|
| unicode_ranges_flags | initializer_list<pair<uint32_t, uint16_t>> | Codepoint ranges with category bitflags (letter=0x0004, number=0x0002, punctuation=0x0020, etc.) |
| unicode_set_whitespace | unordered_set<uint32_t> | Set of Unicode whitespace codepoints |
| unicode_map_lowercase | unordered_map<uint32_t, uint32_t> | Uppercase-to-lowercase codepoint mappings |
| unicode_map_uppercase | unordered_map<uint32_t, uint32_t> | Lowercase-to-uppercase codepoint mappings |
| unicode_ranges_nfd | vector<pair<uint32_t, vector<uint32_t>>> | NFD canonical decomposition data |
Usage Examples
// These tables are used by unicode.cpp functions, not called directly.
// Example of how the flags table is consumed:
// Look up codepoint category flags
// unicode_ranges_flags: {0x000041, 0x0004} means codepoints starting at 0x41 ('A') have flag 0x0004 (letter)
// unicode_ranges_flags: {0x000030, 0x0002} means codepoints starting at 0x30 ('0') have flag 0x0002 (number)
// Case conversion
// unicode_map_lowercase[0x0041] = 0x0061 ('A' -> 'a')
// unicode_map_uppercase[0x0061] = 0x0041 ('a' -> 'A')