Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LaurentMazare Tch rs Tokenizer From File

From Leeroopedia


Knowledge Sources
Domains NLP, Text_Processing
Last Updated 2026-02-08 14:00 GMT

Overview

Concrete tool for loading a SentencePiece BPE tokenizer from a JSON vocabulary file and encoding text to token IDs, provided by the tch-rs LLaMA example.

Description

Tokenizer::from_file parses a HuggingFace-format tokenizer JSON file containing vocabulary (model.vocab) and merge rules (model.merges). Tokenizer::encode converts text to token IDs using BPE with the SentencePiece space prefix convention. The implementation is pure Rust with no external tokenizer dependencies.

Usage

Use to tokenize input prompts for LLaMA text generation. Load the tokenizer JSON file matching the model's training vocabulary.

Code Reference

Source Location

  • Repository: tch-rs
  • File: examples/llama/sentencepiece.rs
  • Lines: 15-76 (from_file), 140-143 (encode)

Signature

impl Tokenizer {
    pub fn from_file<P: AsRef<std::path::Path>>(path: P) -> Result<Self>
    pub fn encode(&self, s: &str) -> Result<Vec<usize>>
}

Import

// Internal to examples/llama/
mod sentencepiece;
use sentencepiece::Tokenizer;

I/O Contract

Inputs (from_file)

Name Type Required Description
path P: AsRef<Path> Yes Path to tokenizer JSON file (HuggingFace format)

Inputs (encode)

Name Type Required Description
s &str Yes Input text to tokenize

Outputs

Name Type Description
from_file Tokenizer Loaded tokenizer with vocabulary and merge rules
encode Vec<usize> Sequence of token IDs

Usage Examples

let tokenizer = Tokenizer::from_file("llama-tokenizer.json")?;
let tokens = tokenizer.encode("Once upon a time")?;
println!("Token IDs: {:?}", tokens);

// Convert to tensor for model input
let token_tensor = Tensor::from_slice(&tokens.iter().map(|&t| t as i64).collect::<Vec<_>>())
    .unsqueeze(0);  // [1, seq_len]

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment