Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LaurentMazare Tch rs Translation Dataset

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Dataset Loading, Sequence to Sequence
Last Updated 2026-02-08 00:00 GMT

Overview

Loads and preprocesses parallel text datasets for sequence-to-sequence translation, including sentence normalization, length filtering, prefix filtering, and vocabulary construction.

Description

The Dataset struct manages paired sentences for translation tasks. It reads tab-separated sentence pairs from a text file (e.g., data/eng-fra.txt), applies normalization (lowercasing, separating punctuation, removing non-alphanumeric characters), and filters pairs based on maximum sentence length and whether either sentence begins with a set of common English prefixes (e.g., "i am", "you are", "he is", etc.).

The normalization function normalize converts all characters to lowercase and replaces non-alphanumeric characters with spaces, while separating punctuation marks (!, ., ?) with leading spaces for tokenization. The prefix filter ensures the dataset focuses on simple conversational sentences.

Two Lang vocabulary objects are built automatically: one for the input language and one for the output language. The pairs method converts sentence pairs into index sequences using their respective vocabularies, appending an EOS token to each sequence. The reverse method swaps input and output languages and flips all sentence pairs, which is used to train translation in the opposite direction (e.g., French-to-English instead of English-to-French).

Usage

Use this dataset loader for seq2seq translation training. It expects a tab-delimited text file at data/{ilang}-{olang}.txt. Call reverse() to swap translation direction. The returned index pairs can be directly used for training encoder-decoder models.

Code Reference

Source Location

Signature

#[derive(Debug)]
pub struct Dataset {
    input_lang: lang::Lang,
    output_lang: lang::Lang,
    pairs: Vec<(String, String)>,
}

impl Dataset {
    pub fn new(ilang: &str, olang: &str, max_length: usize) -> Result<Dataset>
    pub fn input_lang(&self) -> &lang::Lang
    pub fn output_lang(&self) -> &lang::Lang
    pub fn reverse(self) -> Self
    pub fn pairs(&self) -> Vec<(Vec<usize>, Vec<usize>)>
}

fn normalize(s: &str) -> String
fn to_indexes(s: &str, lang: &lang::Lang) -> Vec<usize>
fn filter_prefix(s: &str) -> bool
fn read_pairs(ilang: &str, olang: &str, max_length: usize) -> Result<Vec<(String, String)>>

Import

use super::lang;
use anyhow::{bail, Result};
use std::fs::File;
use std::io::{BufRead, BufReader};

I/O Contract

Input Type Description
ilang &str Input language code (e.g., "eng")
olang &str Output language code (e.g., "fra")
max_length usize Maximum number of words per sentence (sentences exceeding this are filtered out)
Data file Text file Tab-separated sentence pairs at data/{ilang}-{olang}.txt
Output Type Description
Dataset.input_lang() &Lang Vocabulary for the input language
Dataset.output_lang() &Lang Vocabulary for the output language
Dataset.pairs() Vec<(Vec<usize>, Vec<usize>)> Index-encoded sentence pairs with EOS tokens appended
Dataset.reverse() Dataset New dataset with swapped input/output languages and flipped pairs

Usage Examples

use dataset::Dataset;

// Load English-French dataset with max 10 words per sentence
let dataset = Dataset::new("eng", "fra", 10)?;

// Reverse to train French-to-English translation
let dataset = dataset.reverse();

// Access vocabularies
let ilang = dataset.input_lang();
let olang = dataset.output_lang();
println!("Input vocab size: {}", ilang.len());
println!("Output vocab size: {}", olang.len());

// Get index-encoded pairs for training
let pairs = dataset.pairs();
// pairs[0].0 = [45, 12, 8, 1]  (input sentence indices + EOS)
// pairs[0].1 = [23, 56, 3, 1]  (target sentence indices + EOS)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment