Implementation:Ggml org Llama cpp Unicode Data

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Unicode, Tokenization
Last Updated	2026-02-15 00:00 GMT

Overview

Contains auto-generated Unicode character database tables used for codepoint classification, case mapping, and normalization.

Description

This file defines large static data structures generated by `scripts/gen-unicode-data.py`. The tables include `unicode_ranges_flags` which maps codepoint ranges to category bitflags (letter, number, punctuation, etc.), `unicode_set_whitespace` which lists whitespace codepoints, `unicode_map_lowercase` and `unicode_map_uppercase` which provide case conversion mappings, and `unicode_ranges_nfd` which contains NFD (Canonical Decomposition) normalization data. These tables cover the full Unicode range up to 0x110000.

Usage

This is a data-only file that provides Unicode character properties required by the tokenization system. Being auto-generated, it should not be manually edited; instead, the generation script `scripts/gen-unicode-data.py` should be re-run to update the data.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/unicode-data.cpp
Lines: 1-7034

Signature

// Codepoint range to category bitflags mapping
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags;

// Whitespace codepoint set
const std::unordered_set<uint32_t> unicode_set_whitespace;

// Case conversion mappings
const std::unordered_map<uint32_t, uint32_t> unicode_map_lowercase;
const std::unordered_map<uint32_t, uint32_t> unicode_map_uppercase;

// NFD normalization data
const std::vector<std::pair<uint32_t, std::vector<uint32_t>>> unicode_ranges_nfd;

Import

#include "unicode-data.h"
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <unordered_set>

I/O Contract

Inputs

Name	Type	Required	Description
(none)	-	-	This is a data-only file with no runtime inputs; data is generated by scripts/gen-unicode-data.py

Outputs

Name	Type	Description
unicode_ranges_flags	initializer_list<pair<uint32_t, uint16_t>>	Codepoint ranges with category bitflags (letter=0x0004, number=0x0002, punctuation=0x0020, etc.)
unicode_set_whitespace	unordered_set<uint32_t>	Set of Unicode whitespace codepoints
unicode_map_lowercase	unordered_map<uint32_t, uint32_t>	Uppercase-to-lowercase codepoint mappings
unicode_map_uppercase	unordered_map<uint32_t, uint32_t>	Lowercase-to-uppercase codepoint mappings
unicode_ranges_nfd	vector<pair<uint32_t, vector<uint32_t>>>	NFD canonical decomposition data

Usage Examples

// These tables are used by unicode.cpp functions, not called directly.
// Example of how the flags table is consumed:

// Look up codepoint category flags
// unicode_ranges_flags: {0x000041, 0x0004} means codepoints starting at 0x41 ('A') have flag 0x0004 (letter)
// unicode_ranges_flags: {0x000030, 0x0002} means codepoints starting at 0x30 ('0') have flag 0x0002 (number)

// Case conversion
// unicode_map_lowercase[0x0041] = 0x0061  ('A' -> 'a')
// unicode_map_uppercase[0x0061] = 0x0041  ('a' -> 'A')

Related Pages

Principle:Ggml_org_Llama_cpp_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment