Heuristic:Dotnet Machinelearning Tokenizer Caching Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, NLP |
| Last Updated | 2026-02-09 11:00 GMT |
Overview
Cache tokenization results for words of 15 characters or fewer, and minimize FileStream buffers to reduce file locking contention during image loading.
Description
ML.NET's Tiktoken tokenizer caches encoding results for words up to 15 characters in length. Longer words are re-encoded each time because they appear less frequently and caching them would waste memory with low hit rates. Similarly, the image loader uses buffer size of 1 for FileStream to avoid unnecessary memory allocation and file locking contention when loading images in parallel.
Usage
Use this heuristic when optimizing tokenization throughput for text processing pipelines, or when loading images in multi-threaded scenarios. The 15-character cache limit applies to BPE (Byte Pair Encoding) tokenizers where short common words dominate the vocabulary.
The Insight (Rule of Thumb)
Tokenizer Cache Limit:
- Action: Cache tokenization results for short words only.
- Value: `MaxWordLengthToCache = 15` characters.
- Trade-off: Memory savings from not caching long words vs. recomputation cost. Short words (articles, prepositions, common nouns) are high-frequency and benefit most from caching.
FileStream Buffer Minimization:
- Action: Use `bufferSize = 1` for image file loading to avoid file locking.
- Value: Minimal buffer prevents OS file handle contention.
- Trade-off: Slightly less efficient I/O per file, but prevents deadlocks and contention in parallel image loading pipelines.
Reasoning
In natural language, word frequency follows Zipf's law: a small number of short words account for the majority of all word occurrences. Words under 15 characters include virtually all function words and most content words in English and similar languages. Caching these provides excellent hit rates (>90%) with bounded memory usage. Words over 15 characters are typically rare technical terms, URLs, or compound words that rarely repeat.
The FileStream buffer minimization addresses a specific Windows file locking issue where default 4KB buffers hold file handles longer than necessary, causing contention when multiple threads load different images simultaneously.
Code Evidence
Tokenizer cache limit from `src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs:32`:
private const int MaxWordLengthToCache = 15;
FileStream buffer minimization from `src/Microsoft.ML.ImageAnalytics/ImageLoader.cs:250`:
// to avoid locking file, use the construct below to load the image
Buffer size from `src/Microsoft.ML.ImageAnalytics/ImageLoader.cs:301`:
// bufferSize == 1 used to avoid unnecessary buffer in FileStream