Implementation:Ggml org Llama cpp Unicode
| Knowledge Sources | |
|---|---|
| Domains | Unicode, Tokenization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements Unicode text processing utilities required by the tokenization system, including UTF-8 encoding/decoding, codepoint classification, normalization, and regex-based text splitting.
Description
This file provides UTF-8 byte-level encoding/decoding (`unicode_cpt_to_utf8`, `unicode_cpt_from_utf8`, `unicode_len_utf8`), codepoint property lookup using the unicode-data tables (flags for letter, number, whitespace, etc.), NFD normalization, case conversion via lookup tables, Han character detection for CJK ranges, and a `unicode_regex_split` function that splits text according to regex patterns. It includes optimized fast paths for common BPE tokenizer patterns like GPT-2 and GPT-4 regex, as well as custom patterns for Kimi-K2 and AFMoE models. It also implements byte-to-UTF8 fallback encoding used by byte-level BPE tokenizers.
Usage
Use this module as the Unicode processing layer for tokenization. All tokenizer implementations (BPE, SentencePiece, WordPiece) rely on these utilities for text preprocessing, normalization, and splitting.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/unicode.cpp
- Lines: 1-1097
Signature
// UTF-8 encoding/decoding
size_t unicode_len_utf8(char src);
std::string unicode_cpt_to_utf8(uint32_t cpt);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8);
// Normalization and case conversion
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts);
uint32_t unicode_tolower(uint32_t cpt);
// Character classification
bool unicode_cpt_is_han(uint32_t cpt);
// Byte-level BPE support
std::string unicode_byte_to_utf8(uint8_t byte);
// Regex-based text splitting
std::vector<std::string> unicode_regex_split(
const std::string & text, const std::vector<std::string> & regex_exprs);
// Optimized custom regex split implementations
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_kimi_k2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_afmoe(const std::string & text, const std::vector<size_t> & offsets);
Import
#include "unicode.h"
#include "unicode-data.h"
#include <algorithm>
#include <cassert>
#include <cstddef>
#include <cstdint>
#include <map>
#include <regex>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src | char | Yes | Single byte for determining UTF-8 sequence length |
| cpt | uint32_t | Yes | Unicode codepoint for encoding to UTF-8 or classification |
| utf8 | const std::string & | Yes | UTF-8 encoded string for decoding or splitting |
| regex_exprs | const std::vector<std::string> & | Yes | Regex patterns for text splitting (e.g., GPT-2/GPT-4 tokenizer patterns) |
| offset | size_t & | Yes | Current byte offset in UTF-8 string for incremental decoding |
Outputs
| Name | Type | Description |
|---|---|---|
| len | size_t | Number of bytes in a UTF-8 sequence starting with the given byte |
| utf8_string | std::string | UTF-8 encoded string from a codepoint |
| codepoint | uint32_t | Decoded Unicode codepoint from UTF-8 bytes |
| codepoints | std::vector<uint32_t> | All decoded codepoints from a UTF-8 string |
| split_result | std::vector<std::string> | Text segments after regex-based splitting |
| is_han | bool | Whether a codepoint is a CJK/Han character |
Usage Examples
// Determine UTF-8 byte length
size_t len = unicode_len_utf8('\xC3'); // returns 2 (2-byte sequence)
// Encode codepoint to UTF-8
std::string utf8 = unicode_cpt_to_utf8(0x00E9); // returns "e" with accent
// Decode UTF-8 to codepoints
std::vector<uint32_t> cpts = unicode_cpts_from_utf8("Hello");
// NFD normalization
auto normalized = unicode_cpts_normalize_nfd(cpts);
// Regex-based splitting for BPE tokenization
std::vector<std::string> tokens = unicode_regex_split(
"Hello, world!", {"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"});
// CJK character detection
bool is_cjk = unicode_cpt_is_han(0x4E2D); // true for Chinese character