Implementation:Ollama Ollama Llama Unicode Data

Knowledge Sources	Ollama
Domains	LLM Inference, Tokenization
Last Updated	2025-02-15 00:00 GMT

Overview

Auto-generated Unicode character property data tables used by the tokenizer for character classification, case conversion, and text normalization.

Description

Contains large static data structures generated by scripts/gen-unicode-data.py, including: unicode_ranges_flags (codepoint ranges with category bitmask flags for number, letter, separator, punctuation, symbol, control, etc.), unicode_set_whitespace (whitespace codepoints), unicode_map_lowercase/unicode_map_uppercase (case mapping tables), and unicode_ranges_nfd (NFD normalization decomposition ranges). The file is approximately 7000 lines of data tables.

Usage

Provides the essential Unicode property lookup data that the tokenizer needs to correctly classify characters, handle case conversion, and perform text normalization, which is critical for proper BPE and other tokenization schemes.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/unicode-data.cpp
Lines: 1-7034

Signature

// Auto-generated data tables:
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags = {
    // start, flags // last=next_start-1
    {0x000000, 0x0080},
    {0x000020, 0x0008},
    // ... thousands of entries
};

// Additional tables (declared in unicode-data.h):
// const std::initializer_list<uint32_t> unicode_set_whitespace;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_lowercase;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_uppercase;
// const std::initializer_list<...> unicode_ranges_nfd;

Import

#include "unicode-data.h"

I/O Contract

Inputs

Name	Type	Required	Description
N/A	N/A	N/A	Static data tables, no runtime inputs

Outputs

Name	Type	Description
unicode_ranges_flags	initializer_list	Codepoint ranges with category flags
unicode_set_whitespace	initializer_list	Set of whitespace codepoints
unicode_map_lowercase	initializer_list	Lowercase mapping pairs
unicode_map_uppercase	initializer_list	Uppercase mapping pairs

Usage Examples

#include "unicode-data.h"

// Used internally by unicode.cpp for character classification:
// The data tables are consumed during initialization of
// the unicode_cpt_flags_array() function

Related Pages

Principle:Ollama_Ollama_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment