Implementation:LaurentMazare Tch rs Tokenizer From File

Knowledge Sources	tch-rs
Domains	NLP, Text_Processing
Last Updated	2026-02-08 14:00 GMT

Overview

Concrete tool for loading a SentencePiece BPE tokenizer from a JSON vocabulary file and encoding text to token IDs, provided by the tch-rs LLaMA example.

Description

Tokenizer::from_file parses a HuggingFace-format tokenizer JSON file containing vocabulary (model.vocab) and merge rules (model.merges). Tokenizer::encode converts text to token IDs using BPE with the SentencePiece space prefix convention. The implementation is pure Rust with no external tokenizer dependencies.

Usage

Use to tokenize input prompts for LLaMA text generation. Load the tokenizer JSON file matching the model's training vocabulary.

Code Reference

Source Location

Repository: tch-rs
File: examples/llama/sentencepiece.rs
Lines: 15-76 (from_file), 140-143 (encode)

Signature

impl Tokenizer {
    pub fn from_file<P: AsRef<std::path::Path>>(path: P) -> Result<Self>
    pub fn encode(&self, s: &str) -> Result<Vec<usize>>
}

Import

// Internal to examples/llama/
mod sentencepiece;
use sentencepiece::Tokenizer;

I/O Contract

Inputs (from_file)

Name	Type	Required	Description
path	P: AsRef<Path>	Yes	Path to tokenizer JSON file (HuggingFace format)

Inputs (encode)

Name	Type	Required	Description
s	&str	Yes	Input text to tokenize

Outputs

Name	Type	Description
from_file	Tokenizer	Loaded tokenizer with vocabulary and merge rules
encode	Vec<usize>	Sequence of token IDs

Usage Examples

let tokenizer = Tokenizer::from_file("llama-tokenizer.json")?;
let tokens = tokenizer.encode("Once upon a time")?;
println!("Token IDs: {:?}", tokens);

// Convert to tensor for model input
let token_tensor = Tensor::from_slice(&tokens.iter().map(|&t| t as i64).collect::<Vec<_>>())
    .unsqueeze(0);  // [1, seq_len]

Related Pages

Implements Principle

Principle:LaurentMazare_Tch_rs_BPE_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment