Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LaurentMazare Tch rs Translation Lang

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Vocabulary Management, Tokenization
Last Updated 2026-02-08 00:00 GMT

Overview

Manages word-level vocabulary for natural language processing, providing bidirectional word-to-index mapping with special SOS and EOS tokens.

Description

The Lang struct maintains a vocabulary for a single language, mapping between words and integer indices. It uses two hash maps: word_to_index_and_count stores each word's assigned index and its occurrence count, while index_to_word provides reverse lookup from index to word string.

Upon construction via new, the vocabulary is initialized with two special tokens: SOS (Start of Sentence, index 0) and EOS (End of Sentence, index 1). New words are added via add_word, which assigns the next sequential index to unseen words or increments the count for known words. The add_sentence method splits a string on whitespace and adds each word individually.

The struct provides accessors for vocabulary size (len), special token indices (sos_token, eos_token), and word lookup (get_index). The seq_to_string method converts a sequence of indices back to a human-readable space-separated string, which is useful for displaying translation outputs.

Usage

Use this vocabulary manager when building tokenizers for seq2seq models. It is designed to be populated during dataset loading and then used for encoding input sentences to index sequences and decoding model outputs back to text.

Code Reference

Source Location

Signature

#[derive(Debug)]
pub struct Lang {
    name: String,
    word_to_index_and_count: HashMap<String, (usize, usize)>,
    index_to_word: HashMap<usize, String>,
}

impl Lang {
    pub fn new(name: &str) -> Lang
    pub fn add_sentence(&mut self, sentence: &str)
    pub fn len(&self) -> usize
    pub fn sos_token(&self) -> usize
    pub fn eos_token(&self) -> usize
    pub fn name(&self) -> &str
    pub fn get_index(&self, word: &str) -> Option<usize>
    pub fn seq_to_string(&self, seq: &[usize]) -> String
}

Import

use std::collections::HashMap;

I/O Contract

Input Type Description
name &str Language identifier (e.g., "eng", "fra")
sentence &str Whitespace-separated sentence to add to vocabulary
word &str Individual word to look up
seq &[usize] Sequence of word indices to decode
Output Type Description
len() usize Total vocabulary size (including SOS and EOS)
sos_token() usize Index of the Start-of-Sentence token (typically 0)
eos_token() usize Index of the End-of-Sentence token (typically 1)
get_index(word) Option<usize> Index of the word, or None if not in vocabulary
seq_to_string(seq) String Space-separated words decoded from index sequence

Usage Examples

use lang::Lang;

// Create a new vocabulary for English
let mut lang = Lang::new("eng");

// Add sentences to build the vocabulary
lang.add_sentence("i am happy");
lang.add_sentence("you are happy");

// Vocabulary includes SOS, EOS, plus all unique words
println!("Vocab size: {}", lang.len()); // 6: SOS, EOS, i, am, happy, you, are

// Look up word indices
let idx = lang.get_index("happy"); // Some(4)

// Get special token indices
let sos = lang.sos_token(); // 0
let eos = lang.eos_token(); // 1

// Decode index sequence back to text
let text = lang.seq_to_string(&[2, 3, 4, 1]);
// "i am happy EOS"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment