Implementation:LaurentMazare Tch rs Tokenizer From File
| Knowledge Sources | |
|---|---|
| Domains | NLP, Text_Processing |
| Last Updated | 2026-02-08 14:00 GMT |
Overview
Concrete tool for loading a SentencePiece BPE tokenizer from a JSON vocabulary file and encoding text to token IDs, provided by the tch-rs LLaMA example.
Description
Tokenizer::from_file parses a HuggingFace-format tokenizer JSON file containing vocabulary (model.vocab) and merge rules (model.merges). Tokenizer::encode converts text to token IDs using BPE with the SentencePiece space prefix convention. The implementation is pure Rust with no external tokenizer dependencies.
Usage
Use to tokenize input prompts for LLaMA text generation. Load the tokenizer JSON file matching the model's training vocabulary.
Code Reference
Source Location
- Repository: tch-rs
- File: examples/llama/sentencepiece.rs
- Lines: 15-76 (from_file), 140-143 (encode)
Signature
impl Tokenizer {
pub fn from_file<P: AsRef<std::path::Path>>(path: P) -> Result<Self>
pub fn encode(&self, s: &str) -> Result<Vec<usize>>
}
Import
// Internal to examples/llama/
mod sentencepiece;
use sentencepiece::Tokenizer;
I/O Contract
Inputs (from_file)
| Name | Type | Required | Description |
|---|---|---|---|
| path | P: AsRef<Path> | Yes | Path to tokenizer JSON file (HuggingFace format) |
Inputs (encode)
| Name | Type | Required | Description |
|---|---|---|---|
| s | &str | Yes | Input text to tokenize |
Outputs
| Name | Type | Description |
|---|---|---|
| from_file | Tokenizer | Loaded tokenizer with vocabulary and merge rules |
| encode | Vec<usize> | Sequence of token IDs |
Usage Examples
let tokenizer = Tokenizer::from_file("llama-tokenizer.json")?;
let tokens = tokenizer.encode("Once upon a time")?;
println!("Token IDs: {:?}", tokens);
// Convert to tensor for model input
let token_tensor = Tensor::from_slice(&tokens.iter().map(|&t| t as i64).collect::<Vec<_>>())
.unsqueeze(0); // [1, seq_len]