Implementation:LaurentMazare Tch rs Translation Lang
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Vocabulary Management, Tokenization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Manages word-level vocabulary for natural language processing, providing bidirectional word-to-index mapping with special SOS and EOS tokens.
Description
The Lang struct maintains a vocabulary for a single language, mapping between words and integer indices. It uses two hash maps: word_to_index_and_count stores each word's assigned index and its occurrence count, while index_to_word provides reverse lookup from index to word string.
Upon construction via new, the vocabulary is initialized with two special tokens: SOS (Start of Sentence, index 0) and EOS (End of Sentence, index 1). New words are added via add_word, which assigns the next sequential index to unseen words or increments the count for known words. The add_sentence method splits a string on whitespace and adds each word individually.
The struct provides accessors for vocabulary size (len), special token indices (sos_token, eos_token), and word lookup (get_index). The seq_to_string method converts a sequence of indices back to a human-readable space-separated string, which is useful for displaying translation outputs.
Usage
Use this vocabulary manager when building tokenizers for seq2seq models. It is designed to be populated during dataset loading and then used for encoding input sentences to index sequences and decoding model outputs back to text.
Code Reference
Source Location
- Repository: LaurentMazare_Tch_rs
- File: examples/translation/lang.rs
- Lines: 1-73
Signature
#[derive(Debug)]
pub struct Lang {
name: String,
word_to_index_and_count: HashMap<String, (usize, usize)>,
index_to_word: HashMap<usize, String>,
}
impl Lang {
pub fn new(name: &str) -> Lang
pub fn add_sentence(&mut self, sentence: &str)
pub fn len(&self) -> usize
pub fn sos_token(&self) -> usize
pub fn eos_token(&self) -> usize
pub fn name(&self) -> &str
pub fn get_index(&self, word: &str) -> Option<usize>
pub fn seq_to_string(&self, seq: &[usize]) -> String
}
Import
use std::collections::HashMap;
I/O Contract
| Input | Type | Description |
|---|---|---|
| name | &str | Language identifier (e.g., "eng", "fra") |
| sentence | &str | Whitespace-separated sentence to add to vocabulary |
| word | &str | Individual word to look up |
| seq | &[usize] | Sequence of word indices to decode |
| Output | Type | Description |
|---|---|---|
| len() | usize | Total vocabulary size (including SOS and EOS) |
| sos_token() | usize | Index of the Start-of-Sentence token (typically 0) |
| eos_token() | usize | Index of the End-of-Sentence token (typically 1) |
| get_index(word) | Option<usize> | Index of the word, or None if not in vocabulary |
| seq_to_string(seq) | String | Space-separated words decoded from index sequence |
Usage Examples
use lang::Lang;
// Create a new vocabulary for English
let mut lang = Lang::new("eng");
// Add sentences to build the vocabulary
lang.add_sentence("i am happy");
lang.add_sentence("you are happy");
// Vocabulary includes SOS, EOS, plus all unique words
println!("Vocab size: {}", lang.len()); // 6: SOS, EOS, i, am, happy, you, are
// Look up word indices
let idx = lang.get_index("happy"); // Some(4)
// Get special token indices
let sos = lang.sos_token(); // 0
let eos = lang.eos_token(); // 1
// Decode index sequence back to text
let text = lang.seq_to_string(&[2, 3, 4, 1]);
// "i am happy EOS"