Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Unicode Header

From Leeroopedia
Knowledge Sources
Domains Unicode, Text_Processing
Last Updated 2026-02-15 00:00 GMT

Overview

Declares Unicode text processing types and functions used by the tokenization system for multilingual text handling.

Description

This header defines the `unicode_cpt_flags` bitfield struct with category flags (letter, number, separator, punctuation, symbol, control, accent mark) and helper flags (whitespace, lowercase, uppercase, NFD), supporting both little-endian and big-endian byte orders. It exposes functions for UTF-8 length calculation, codepoint encoding/decoding, codepoint collection from UTF-8 strings, NFD normalization, codepoint flag lookup, byte-level BPE encoding/decoding, case conversion, Han character detection, and regex-based text splitting.

Usage

Use this header when implementing or extending tokenizer functionality that requires Unicode-aware text processing, such as codepoint classification, normalization, case conversion, or regex-based pre-tokenization splitting.

Code Reference

Source Location

Signature

struct unicode_cpt_flags {
    uint16_t is_undefined   : 1;
    uint16_t is_number      : 1;
    uint16_t is_letter      : 1;
    uint16_t is_separator   : 1;
    uint16_t is_accent_mark : 1;
    uint16_t is_punctuation : 1;
    uint16_t is_symbol      : 1;
    uint16_t is_control     : 1;
    uint16_t is_whitespace  : 1;
    uint16_t is_lowercase   : 1;
    uint16_t is_uppercase   : 1;
    uint16_t is_nfd         : 1;

    inline unicode_cpt_flags(const uint16_t flags = 0);
    inline uint16_t as_uint() const;
    inline uint16_t category_flag() const;
};

size_t unicode_len_utf8(char src);
std::string unicode_cpt_to_utf8(uint32_t cpt);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8);
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts);
unicode_cpt_flags unicode_cpt_flags_from_cpt(uint32_t cpt);
unicode_cpt_flags unicode_cpt_flags_from_utf8(const std::string & utf8);
std::string unicode_byte_to_utf8(uint8_t byte);
uint8_t unicode_utf8_to_byte(const std::string & utf8);
uint32_t unicode_tolower(uint32_t cpt);
bool unicode_cpt_is_han(uint32_t cpt);
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);

Import

#include "unicode.h"

I/O Contract

Inputs

Name Type Required Description
src char Yes First byte of a UTF-8 sequence (for unicode_len_utf8)
cpt uint32_t Yes Unicode codepoint value
utf8 const std::string & Yes UTF-8 encoded string
offset size_t & Yes Current read offset into the UTF-8 string (updated in place)
cpts const std::vector<uint32_t> & Yes Vector of codepoints for NFD normalization
byte uint8_t Yes Single byte value for BPE byte-level encoding
text const std::string & Yes Input text for regex splitting
regex_exprs const std::vector<std::string> & Yes List of regex patterns for text splitting

Outputs

Name Type Description
unicode_len_utf8 size_t Number of bytes in the UTF-8 character starting with src
unicode_cpt_to_utf8 std::string UTF-8 encoded string for the given codepoint
unicode_cpt_from_utf8 uint32_t Decoded codepoint from the UTF-8 string at offset
unicode_cpts_from_utf8 std::vector<uint32_t> All codepoints decoded from a UTF-8 string
unicode_cpt_flags_from_cpt unicode_cpt_flags Category and property flags for a codepoint
unicode_regex_split std::vector<std::string> Text segments split according to regex patterns

Usage Examples

#include "unicode.h"

// Decode UTF-8 string into codepoints
std::vector<uint32_t> cpts = unicode_cpts_from_utf8("Hello");

// Check codepoint properties
unicode_cpt_flags flags = unicode_cpt_flags_from_cpt(cpts[0]);
if (flags.is_letter && flags.is_uppercase) {
    // uppercase letter
}

// Normalize to NFD
auto nfd = unicode_cpts_normalize_nfd(cpts);

// Regex-based text splitting for tokenization
auto segments = unicode_regex_split("Hello world!", {"\\w+", "\\s+"});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment