Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama Unicode Data

From Leeroopedia
Knowledge Sources
Domains LLM Inference, Tokenization
Last Updated 2025-02-15 00:00 GMT

Overview

Auto-generated Unicode character property data tables used by the tokenizer for character classification, case conversion, and text normalization.

Description

Contains large static data structures generated by scripts/gen-unicode-data.py, including: unicode_ranges_flags (codepoint ranges with category bitmask flags for number, letter, separator, punctuation, symbol, control, etc.), unicode_set_whitespace (whitespace codepoints), unicode_map_lowercase/unicode_map_uppercase (case mapping tables), and unicode_ranges_nfd (NFD normalization decomposition ranges). The file is approximately 7000 lines of data tables.

Usage

Provides the essential Unicode property lookup data that the tokenizer needs to correctly classify characters, handle case conversion, and perform text normalization, which is critical for proper BPE and other tokenization schemes.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/unicode-data.cpp
  • Lines: 1-7034

Signature

// Auto-generated data tables:
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags = {
    // start, flags // last=next_start-1
    {0x000000, 0x0080},
    {0x000020, 0x0008},
    // ... thousands of entries
};

// Additional tables (declared in unicode-data.h):
// const std::initializer_list<uint32_t> unicode_set_whitespace;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_lowercase;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_uppercase;
// const std::initializer_list<...> unicode_ranges_nfd;

Import

#include "unicode-data.h"

I/O Contract

Inputs

Name Type Required Description
N/A N/A N/A Static data tables, no runtime inputs

Outputs

Name Type Description
unicode_ranges_flags initializer_list Codepoint ranges with category flags
unicode_set_whitespace initializer_list Set of whitespace codepoints
unicode_map_lowercase initializer_list Lowercase mapping pairs
unicode_map_uppercase initializer_list Uppercase mapping pairs

Usage Examples

#include "unicode-data.h"

// Used internally by unicode.cpp for character classification:
// The data tables are consumed during initialization of
// the unicode_cpt_flags_array() function

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment