Implementation:Ollama Ollama Llama Unicode
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Tokenization |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Unicode text processing utilities for the llama.cpp tokenizer, providing UTF-8 encoding/decoding, codepoint classification, case conversion, NFD normalization, and regex-based text splitting.
Description
Implements UTF-8/UTF-16/UTF-32 encoding and decoding functions, codepoint classification using the data tables from unicode-data.cpp, case conversion, NFD normalization, and regex-based text splitting for BPE tokenization. Key functions include unicode_cpt_from_utf8 (decode UTF-8 to codepoint), unicode_cpt_to_utf8 (encode codepoint to UTF-8), unicode_cpt_flags_from_cpt (look up character properties), and unicode_regex_split (split text using regex patterns for various tokenizer vocabularies).
Usage
Core utility that underpins all tokenization in llama.cpp. Every model's tokenizer relies on these functions for correct text handling.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/unicode.cpp - Lines: 1-1159
Signature
size_t unicode_len_utf8(char src);
static std::string unicode_cpts_to_utf8(const std::vector<uint32_t> & cps);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);
static std::vector<unicode_cpt_flags> unicode_cpt_flags_array();
// Public API (declared in unicode.h):
// std::string unicode_cpt_to_utf8(uint32_t cpt);
// unicode_cpt_flags unicode_cpt_flags_from_cpt(uint32_t cpt);
// std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);
Import
#include "unicode.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| utf8 | const std::string & | Yes | UTF-8 encoded text input |
| cpt | uint32_t | Yes | Unicode codepoint |
| text | const std::string & | Yes | Text for regex splitting |
Outputs
| Name | Type | Description |
|---|---|---|
| codepoint | uint32_t | Decoded Unicode codepoint |
| utf8 string | std::string | Encoded UTF-8 string |
| flags | unicode_cpt_flags | Character property flags |
| split | std::vector<std::string> | Regex-split text segments |
Usage Examples
#include "unicode.h"
// Get UTF-8 character length:
size_t len = unicode_len_utf8(ch);
// Decode codepoint:
size_t offset = 0;
uint32_t cp = unicode_cpt_from_utf8(text, offset);
// Character classification:
auto flags = unicode_cpt_flags_from_cpt(cp);
bool is_letter = flags.is_letter;
// Regex split for BPE:
auto parts = unicode_regex_split(text, regex_exprs);