Implementation:Ggml org Llama cpp Unicode Header
| Knowledge Sources | |
|---|---|
| Domains | Unicode, Text_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares Unicode text processing types and functions used by the tokenization system for multilingual text handling.
Description
This header defines the `unicode_cpt_flags` bitfield struct with category flags (letter, number, separator, punctuation, symbol, control, accent mark) and helper flags (whitespace, lowercase, uppercase, NFD), supporting both little-endian and big-endian byte orders. It exposes functions for UTF-8 length calculation, codepoint encoding/decoding, codepoint collection from UTF-8 strings, NFD normalization, codepoint flag lookup, byte-level BPE encoding/decoding, case conversion, Han character detection, and regex-based text splitting.
Usage
Use this header when implementing or extending tokenizer functionality that requires Unicode-aware text processing, such as codepoint classification, normalization, case conversion, or regex-based pre-tokenization splitting.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/unicode.h
- Lines: 1-111
Signature
struct unicode_cpt_flags {
uint16_t is_undefined : 1;
uint16_t is_number : 1;
uint16_t is_letter : 1;
uint16_t is_separator : 1;
uint16_t is_accent_mark : 1;
uint16_t is_punctuation : 1;
uint16_t is_symbol : 1;
uint16_t is_control : 1;
uint16_t is_whitespace : 1;
uint16_t is_lowercase : 1;
uint16_t is_uppercase : 1;
uint16_t is_nfd : 1;
inline unicode_cpt_flags(const uint16_t flags = 0);
inline uint16_t as_uint() const;
inline uint16_t category_flag() const;
};
size_t unicode_len_utf8(char src);
std::string unicode_cpt_to_utf8(uint32_t cpt);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8);
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts);
unicode_cpt_flags unicode_cpt_flags_from_cpt(uint32_t cpt);
unicode_cpt_flags unicode_cpt_flags_from_utf8(const std::string & utf8);
std::string unicode_byte_to_utf8(uint8_t byte);
uint8_t unicode_utf8_to_byte(const std::string & utf8);
uint32_t unicode_tolower(uint32_t cpt);
bool unicode_cpt_is_han(uint32_t cpt);
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);
Import
#include "unicode.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src | char | Yes | First byte of a UTF-8 sequence (for unicode_len_utf8) |
| cpt | uint32_t | Yes | Unicode codepoint value |
| utf8 | const std::string & | Yes | UTF-8 encoded string |
| offset | size_t & | Yes | Current read offset into the UTF-8 string (updated in place) |
| cpts | const std::vector<uint32_t> & | Yes | Vector of codepoints for NFD normalization |
| byte | uint8_t | Yes | Single byte value for BPE byte-level encoding |
| text | const std::string & | Yes | Input text for regex splitting |
| regex_exprs | const std::vector<std::string> & | Yes | List of regex patterns for text splitting |
Outputs
| Name | Type | Description |
|---|---|---|
| unicode_len_utf8 | size_t | Number of bytes in the UTF-8 character starting with src |
| unicode_cpt_to_utf8 | std::string | UTF-8 encoded string for the given codepoint |
| unicode_cpt_from_utf8 | uint32_t | Decoded codepoint from the UTF-8 string at offset |
| unicode_cpts_from_utf8 | std::vector<uint32_t> | All codepoints decoded from a UTF-8 string |
| unicode_cpt_flags_from_cpt | unicode_cpt_flags | Category and property flags for a codepoint |
| unicode_regex_split | std::vector<std::string> | Text segments split according to regex patterns |
Usage Examples
#include "unicode.h"
// Decode UTF-8 string into codepoints
std::vector<uint32_t> cpts = unicode_cpts_from_utf8("Hello");
// Check codepoint properties
unicode_cpt_flags flags = unicode_cpt_flags_from_cpt(cpts[0]);
if (flags.is_letter && flags.is_uppercase) {
// uppercase letter
}
// Normalize to NFD
auto nfd = unicode_cpts_normalize_nfd(cpts);
// Regex-based text splitting for tokenization
auto segments = unicode_regex_split("Hello world!", {"\\w+", "\\s+"});