Heuristic:Dotnet Machinelearning Tokenizer Caching Strategy

Knowledge Sources	dotnet/machinelearning ML.NET Team
Domains	Optimization, NLP
Last Updated	2026-02-09 11:00 GMT

Overview

Cache tokenization results for words of 15 characters or fewer, and minimize FileStream buffers to reduce file locking contention during image loading.

Description

ML.NET's Tiktoken tokenizer caches encoding results for words up to 15 characters in length. Longer words are re-encoded each time because they appear less frequently and caching them would waste memory with low hit rates. Similarly, the image loader uses buffer size of 1 for FileStream to avoid unnecessary memory allocation and file locking contention when loading images in parallel.

Usage

Use this heuristic when optimizing tokenization throughput for text processing pipelines, or when loading images in multi-threaded scenarios. The 15-character cache limit applies to BPE (Byte Pair Encoding) tokenizers where short common words dominate the vocabulary.

The Insight (Rule of Thumb)

Tokenizer Cache Limit:

Action: Cache tokenization results for short words only.
Value: `MaxWordLengthToCache = 15` characters.
Trade-off: Memory savings from not caching long words vs. recomputation cost. Short words (articles, prepositions, common nouns) are high-frequency and benefit most from caching.

FileStream Buffer Minimization:

Action: Use `bufferSize = 1` for image file loading to avoid file locking.
Value: Minimal buffer prevents OS file handle contention.
Trade-off: Slightly less efficient I/O per file, but prevents deadlocks and contention in parallel image loading pipelines.

Reasoning

In natural language, word frequency follows Zipf's law: a small number of short words account for the majority of all word occurrences. Words under 15 characters include virtually all function words and most content words in English and similar languages. Caching these provides excellent hit rates (>90%) with bounded memory usage. Words over 15 characters are typically rare technical terms, URLs, or compound words that rarely repeat.

The FileStream buffer minimization addresses a specific Windows file locking issue where default 4KB buffers hold file handles longer than necessary, causing contention when multiple threads load different images simultaneously.

Code Evidence

Tokenizer cache limit from `src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs:32`:

private const int MaxWordLengthToCache = 15;

FileStream buffer minimization from `src/Microsoft.ML.ImageAnalytics/ImageLoader.cs:250`:

// to avoid locking file, use the construct below to load the image

Buffer size from `src/Microsoft.ML.ImageAnalytics/ImageLoader.cs:301`:

// bufferSize == 1 used to avoid unnecessary buffer in FileStream

Related Pages

Implementation:Dotnet_Machinelearning_Text_Featurization_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment