Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Unicode

From Leeroopedia
Knowledge Sources
Domains Unicode, Tokenization
Last Updated 2026-02-15 00:00 GMT

Overview

Implements Unicode text processing utilities required by the tokenization system, including UTF-8 encoding/decoding, codepoint classification, normalization, and regex-based text splitting.

Description

This file provides UTF-8 byte-level encoding/decoding (`unicode_cpt_to_utf8`, `unicode_cpt_from_utf8`, `unicode_len_utf8`), codepoint property lookup using the unicode-data tables (flags for letter, number, whitespace, etc.), NFD normalization, case conversion via lookup tables, Han character detection for CJK ranges, and a `unicode_regex_split` function that splits text according to regex patterns. It includes optimized fast paths for common BPE tokenizer patterns like GPT-2 and GPT-4 regex, as well as custom patterns for Kimi-K2 and AFMoE models. It also implements byte-to-UTF8 fallback encoding used by byte-level BPE tokenizers.

Usage

Use this module as the Unicode processing layer for tokenization. All tokenizer implementations (BPE, SentencePiece, WordPiece) rely on these utilities for text preprocessing, normalization, and splitting.

Code Reference

Source Location

Signature

// UTF-8 encoding/decoding
size_t unicode_len_utf8(char src);
std::string unicode_cpt_to_utf8(uint32_t cpt);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8);

// Normalization and case conversion
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts);
uint32_t unicode_tolower(uint32_t cpt);

// Character classification
bool unicode_cpt_is_han(uint32_t cpt);

// Byte-level BPE support
std::string unicode_byte_to_utf8(uint8_t byte);

// Regex-based text splitting
std::vector<std::string> unicode_regex_split(
    const std::string & text, const std::vector<std::string> & regex_exprs);

// Optimized custom regex split implementations
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_kimi_k2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_afmoe(const std::string & text, const std::vector<size_t> & offsets);

Import

#include "unicode.h"
#include "unicode-data.h"
#include <algorithm>
#include <cassert>
#include <cstddef>
#include <cstdint>
#include <map>
#include <regex>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
src char Yes Single byte for determining UTF-8 sequence length
cpt uint32_t Yes Unicode codepoint for encoding to UTF-8 or classification
utf8 const std::string & Yes UTF-8 encoded string for decoding or splitting
regex_exprs const std::vector<std::string> & Yes Regex patterns for text splitting (e.g., GPT-2/GPT-4 tokenizer patterns)
offset size_t & Yes Current byte offset in UTF-8 string for incremental decoding

Outputs

Name Type Description
len size_t Number of bytes in a UTF-8 sequence starting with the given byte
utf8_string std::string UTF-8 encoded string from a codepoint
codepoint uint32_t Decoded Unicode codepoint from UTF-8 bytes
codepoints std::vector<uint32_t> All decoded codepoints from a UTF-8 string
split_result std::vector<std::string> Text segments after regex-based splitting
is_han bool Whether a codepoint is a CJK/Han character

Usage Examples

// Determine UTF-8 byte length
size_t len = unicode_len_utf8('\xC3'); // returns 2 (2-byte sequence)

// Encode codepoint to UTF-8
std::string utf8 = unicode_cpt_to_utf8(0x00E9); // returns "e" with accent

// Decode UTF-8 to codepoints
std::vector<uint32_t> cpts = unicode_cpts_from_utf8("Hello");

// NFD normalization
auto normalized = unicode_cpts_normalize_nfd(cpts);

// Regex-based splitting for BPE tokenization
std::vector<std::string> tokens = unicode_regex_split(
    "Hello, world!", {"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"});

// CJK character detection
bool is_cjk = unicode_cpt_is_han(0x4E2D); // true for Chinese character

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment