Implementation:Ggml org Llama cpp Unicode

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Unicode, Tokenization
Last Updated	2026-02-15 00:00 GMT

Overview

Implements Unicode text processing utilities required by the tokenization system, including UTF-8 encoding/decoding, codepoint classification, normalization, and regex-based text splitting.

Description

This file provides UTF-8 byte-level encoding/decoding (`unicode_cpt_to_utf8`, `unicode_cpt_from_utf8`, `unicode_len_utf8`), codepoint property lookup using the unicode-data tables (flags for letter, number, whitespace, etc.), NFD normalization, case conversion via lookup tables, Han character detection for CJK ranges, and a `unicode_regex_split` function that splits text according to regex patterns. It includes optimized fast paths for common BPE tokenizer patterns like GPT-2 and GPT-4 regex, as well as custom patterns for Kimi-K2 and AFMoE models. It also implements byte-to-UTF8 fallback encoding used by byte-level BPE tokenizers.

Usage

Use this module as the Unicode processing layer for tokenization. All tokenizer implementations (BPE, SentencePiece, WordPiece) rely on these utilities for text preprocessing, normalization, and splitting.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/unicode.cpp
Lines: 1-1097

Signature

// UTF-8 encoding/decoding
size_t unicode_len_utf8(char src);
std::string unicode_cpt_to_utf8(uint32_t cpt);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8);

// Normalization and case conversion
std::vector<uint32_t> unicode_cpts_normalize_nfd(const std::vector<uint32_t> & cpts);
uint32_t unicode_tolower(uint32_t cpt);

// Character classification
bool unicode_cpt_is_han(uint32_t cpt);

// Byte-level BPE support
std::string unicode_byte_to_utf8(uint8_t byte);

// Regex-based text splitting
std::vector<std::string> unicode_regex_split(
    const std::string & text, const std::vector<std::string> & regex_exprs);

// Optimized custom regex split implementations
static std::vector<size_t> unicode_regex_split_custom_gpt2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_llama3(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_kimi_k2(const std::string & text, const std::vector<size_t> & offsets);
static std::vector<size_t> unicode_regex_split_custom_afmoe(const std::string & text, const std::vector<size_t> & offsets);

Import

#include "unicode.h"
#include "unicode-data.h"
#include <algorithm>
#include <cassert>
#include <cstddef>
#include <cstdint>
#include <map>
#include <regex>
#include <stdexcept>
#include <string>
#include <unordered_map>
#include <utility>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
src	char	Yes	Single byte for determining UTF-8 sequence length
cpt	uint32_t	Yes	Unicode codepoint for encoding to UTF-8 or classification
utf8	const std::string &	Yes	UTF-8 encoded string for decoding or splitting
regex_exprs	const std::vector<std::string> &	Yes	Regex patterns for text splitting (e.g., GPT-2/GPT-4 tokenizer patterns)
offset	size_t &	Yes	Current byte offset in UTF-8 string for incremental decoding

Outputs

Name	Type	Description
len	size_t	Number of bytes in a UTF-8 sequence starting with the given byte
utf8_string	std::string	UTF-8 encoded string from a codepoint
codepoint	uint32_t	Decoded Unicode codepoint from UTF-8 bytes
codepoints	std::vector<uint32_t>	All decoded codepoints from a UTF-8 string
split_result	std::vector<std::string>	Text segments after regex-based splitting
is_han	bool	Whether a codepoint is a CJK/Han character

Usage Examples

// Determine UTF-8 byte length
size_t len = unicode_len_utf8('\xC3'); // returns 2 (2-byte sequence)

// Encode codepoint to UTF-8
std::string utf8 = unicode_cpt_to_utf8(0x00E9); // returns "e" with accent

// Decode UTF-8 to codepoints
std::vector<uint32_t> cpts = unicode_cpts_from_utf8("Hello");

// NFD normalization
auto normalized = unicode_cpts_normalize_nfd(cpts);

// Regex-based splitting for BPE tokenization
std::vector<std::string> tokens = unicode_regex_split(
    "Hello, world!", {"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+"});

// CJK character detection
bool is_cjk = unicode_cpt_is_han(0x4E2D); // true for Chinese character

Related Pages

Principle:Ggml_org_Llama_cpp_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment