Implementation:Ollama Ollama Llama Unicode Data
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Tokenization |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Auto-generated Unicode character property data tables used by the tokenizer for character classification, case conversion, and text normalization.
Description
Contains large static data structures generated by scripts/gen-unicode-data.py, including: unicode_ranges_flags (codepoint ranges with category bitmask flags for number, letter, separator, punctuation, symbol, control, etc.), unicode_set_whitespace (whitespace codepoints), unicode_map_lowercase/unicode_map_uppercase (case mapping tables), and unicode_ranges_nfd (NFD normalization decomposition ranges). The file is approximately 7000 lines of data tables.
Usage
Provides the essential Unicode property lookup data that the tokenizer needs to correctly classify characters, handle case conversion, and perform text normalization, which is critical for proper BPE and other tokenization schemes.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/unicode-data.cpp - Lines: 1-7034
Signature
// Auto-generated data tables:
const std::initializer_list<std::pair<uint32_t, uint16_t>> unicode_ranges_flags = {
// start, flags // last=next_start-1
{0x000000, 0x0080},
{0x000020, 0x0008},
// ... thousands of entries
};
// Additional tables (declared in unicode-data.h):
// const std::initializer_list<uint32_t> unicode_set_whitespace;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_lowercase;
// const std::initializer_list<std::pair<uint32_t, uint32_t>> unicode_map_uppercase;
// const std::initializer_list<...> unicode_ranges_nfd;
Import
#include "unicode-data.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| N/A | N/A | N/A | Static data tables, no runtime inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| unicode_ranges_flags | initializer_list | Codepoint ranges with category flags |
| unicode_set_whitespace | initializer_list | Set of whitespace codepoints |
| unicode_map_lowercase | initializer_list | Lowercase mapping pairs |
| unicode_map_uppercase | initializer_list | Uppercase mapping pairs |
Usage Examples
#include "unicode-data.h"
// Used internally by unicode.cpp for character classification:
// The data tables are consumed during initialization of
// the unicode_cpt_flags_array() function