Implementation:LaurentMazare Tch rs Translation Lang

Knowledge Sources	LaurentMazare_Tch_rs
Domains	Natural Language Processing, Vocabulary Management, Tokenization
Last Updated	2026-02-08 00:00 GMT

Overview

Manages word-level vocabulary for natural language processing, providing bidirectional word-to-index mapping with special SOS and EOS tokens.

Description

The Lang struct maintains a vocabulary for a single language, mapping between words and integer indices. It uses two hash maps: word_to_index_and_count stores each word's assigned index and its occurrence count, while index_to_word provides reverse lookup from index to word string.

Upon construction via new, the vocabulary is initialized with two special tokens: SOS (Start of Sentence, index 0) and EOS (End of Sentence, index 1). New words are added via add_word, which assigns the next sequential index to unseen words or increments the count for known words. The add_sentence method splits a string on whitespace and adds each word individually.

The struct provides accessors for vocabulary size (len), special token indices (sos_token, eos_token), and word lookup (get_index). The seq_to_string method converts a sequence of indices back to a human-readable space-separated string, which is useful for displaying translation outputs.

Usage

Use this vocabulary manager when building tokenizers for seq2seq models. It is designed to be populated during dataset loading and then used for encoding input sentences to index sequences and decoding model outputs back to text.

Code Reference

Source Location

Repository: LaurentMazare_Tch_rs
File: examples/translation/lang.rs
Lines: 1-73

Signature

#[derive(Debug)]
pub struct Lang {
    name: String,
    word_to_index_and_count: HashMap<String, (usize, usize)>,
    index_to_word: HashMap<usize, String>,
}

impl Lang {
    pub fn new(name: &str) -> Lang
    pub fn add_sentence(&mut self, sentence: &str)
    pub fn len(&self) -> usize
    pub fn sos_token(&self) -> usize
    pub fn eos_token(&self) -> usize
    pub fn name(&self) -> &str
    pub fn get_index(&self, word: &str) -> Option<usize>
    pub fn seq_to_string(&self, seq: &[usize]) -> String
}

Import

use std::collections::HashMap;

I/O Contract

Input	Type	Description
name	&str	Language identifier (e.g., "eng", "fra")
sentence	&str	Whitespace-separated sentence to add to vocabulary
word	&str	Individual word to look up
seq	&[usize]	Sequence of word indices to decode

Output	Type	Description
len()	usize	Total vocabulary size (including SOS and EOS)
sos_token()	usize	Index of the Start-of-Sentence token (typically 0)
eos_token()	usize	Index of the End-of-Sentence token (typically 1)
get_index(word)	Option<usize>	Index of the word, or None if not in vocabulary
seq_to_string(seq)	String	Space-separated words decoded from index sequence

Usage Examples

use lang::Lang;

// Create a new vocabulary for English
let mut lang = Lang::new("eng");

// Add sentences to build the vocabulary
lang.add_sentence("i am happy");
lang.add_sentence("you are happy");

// Vocabulary includes SOS, EOS, plus all unique words
println!("Vocab size: {}", lang.len()); // 6: SOS, EOS, i, am, happy, you, are

// Look up word indices
let idx = lang.get_index("happy"); // Some(4)

// Get special token indices
let sos = lang.sos_token(); // 0
let eos = lang.eos_token(); // 1

// Decode index sequence back to text
let text = lang.seq_to_string(&[2, 3, 4, 1]);
// "i am happy EOS"

Related Pages

Principle:LaurentMazare_Tch_rs_Vocabulary_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment