Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama Unicode

From Leeroopedia
Knowledge Sources
Domains LLM Inference, Tokenization
Last Updated 2025-02-15 00:00 GMT

Overview

Unicode text processing utilities for the llama.cpp tokenizer, providing UTF-8 encoding/decoding, codepoint classification, case conversion, NFD normalization, and regex-based text splitting.

Description

Implements UTF-8/UTF-16/UTF-32 encoding and decoding functions, codepoint classification using the data tables from unicode-data.cpp, case conversion, NFD normalization, and regex-based text splitting for BPE tokenization. Key functions include unicode_cpt_from_utf8 (decode UTF-8 to codepoint), unicode_cpt_to_utf8 (encode codepoint to UTF-8), unicode_cpt_flags_from_cpt (look up character properties), and unicode_regex_split (split text using regex patterns for various tokenizer vocabularies).

Usage

Core utility that underpins all tokenization in llama.cpp. Every model's tokenizer relies on these functions for correct text handling.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/unicode.cpp
  • Lines: 1-1159

Signature

size_t unicode_len_utf8(char src);

static std::string unicode_cpts_to_utf8(const std::vector<uint32_t> & cps);
uint32_t unicode_cpt_from_utf8(const std::string & utf8, size_t & offset);

static std::vector<unicode_cpt_flags> unicode_cpt_flags_array();

// Public API (declared in unicode.h):
// std::string unicode_cpt_to_utf8(uint32_t cpt);
// unicode_cpt_flags unicode_cpt_flags_from_cpt(uint32_t cpt);
// std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);

Import

#include "unicode.h"

I/O Contract

Inputs

Name Type Required Description
utf8 const std::string & Yes UTF-8 encoded text input
cpt uint32_t Yes Unicode codepoint
text const std::string & Yes Text for regex splitting

Outputs

Name Type Description
codepoint uint32_t Decoded Unicode codepoint
utf8 string std::string Encoded UTF-8 string
flags unicode_cpt_flags Character property flags
split std::vector<std::string> Regex-split text segments

Usage Examples

#include "unicode.h"

// Get UTF-8 character length:
size_t len = unicode_len_utf8(ch);

// Decode codepoint:
size_t offset = 0;
uint32_t cp = unicode_cpt_from_utf8(text, offset);

// Character classification:
auto flags = unicode_cpt_flags_from_cpt(cp);
bool is_letter = flags.is_letter;

// Regex split for BPE:
auto parts = unicode_regex_split(text, regex_exprs);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment