Implementation:LaurentMazare Tch rs Translation Dataset

Knowledge Sources	LaurentMazare_Tch_rs
Domains	Natural Language Processing, Dataset Loading, Sequence to Sequence
Last Updated	2026-02-08 00:00 GMT

Overview

Loads and preprocesses parallel text datasets for sequence-to-sequence translation, including sentence normalization, length filtering, prefix filtering, and vocabulary construction.

Description

The Dataset struct manages paired sentences for translation tasks. It reads tab-separated sentence pairs from a text file (e.g., data/eng-fra.txt), applies normalization (lowercasing, separating punctuation, removing non-alphanumeric characters), and filters pairs based on maximum sentence length and whether either sentence begins with a set of common English prefixes (e.g., "i am", "you are", "he is", etc.).

The normalization function normalize converts all characters to lowercase and replaces non-alphanumeric characters with spaces, while separating punctuation marks (!, ., ?) with leading spaces for tokenization. The prefix filter ensures the dataset focuses on simple conversational sentences.

Two Lang vocabulary objects are built automatically: one for the input language and one for the output language. The pairs method converts sentence pairs into index sequences using their respective vocabularies, appending an EOS token to each sequence. The reverse method swaps input and output languages and flips all sentence pairs, which is used to train translation in the opposite direction (e.g., French-to-English instead of English-to-French).

Usage

Use this dataset loader for seq2seq translation training. It expects a tab-delimited text file at data/{ilang}-{olang}.txt. Call reverse() to swap translation direction. The returned index pairs can be directly used for training encoder-decoder models.

Code Reference

Source Location

Repository: LaurentMazare_Tch_rs
File: examples/translation/dataset.rs
Lines: 1-115

Signature

#[derive(Debug)]
pub struct Dataset {
    input_lang: lang::Lang,
    output_lang: lang::Lang,
    pairs: Vec<(String, String)>,
}

impl Dataset {
    pub fn new(ilang: &str, olang: &str, max_length: usize) -> Result<Dataset>
    pub fn input_lang(&self) -> &lang::Lang
    pub fn output_lang(&self) -> &lang::Lang
    pub fn reverse(self) -> Self
    pub fn pairs(&self) -> Vec<(Vec<usize>, Vec<usize>)>
}

fn normalize(s: &str) -> String
fn to_indexes(s: &str, lang: &lang::Lang) -> Vec<usize>
fn filter_prefix(s: &str) -> bool
fn read_pairs(ilang: &str, olang: &str, max_length: usize) -> Result<Vec<(String, String)>>

Import

use super::lang;
use anyhow::{bail, Result};
use std::fs::File;
use std::io::{BufRead, BufReader};

I/O Contract

Input	Type	Description
ilang	&str	Input language code (e.g., "eng")
olang	&str	Output language code (e.g., "fra")
max_length	usize	Maximum number of words per sentence (sentences exceeding this are filtered out)
Data file	Text file	Tab-separated sentence pairs at data/{ilang}-{olang}.txt

Output	Type	Description
Dataset.input_lang()	&Lang	Vocabulary for the input language
Dataset.output_lang()	&Lang	Vocabulary for the output language
Dataset.pairs()	Vec<(Vec<usize>, Vec<usize>)>	Index-encoded sentence pairs with EOS tokens appended
Dataset.reverse()	Dataset	New dataset with swapped input/output languages and flipped pairs

Usage Examples

use dataset::Dataset;

// Load English-French dataset with max 10 words per sentence
let dataset = Dataset::new("eng", "fra", 10)?;

// Reverse to train French-to-English translation
let dataset = dataset.reverse();

// Access vocabularies
let ilang = dataset.input_lang();
let olang = dataset.output_lang();
println!("Input vocab size: {}", ilang.len());
println!("Output vocab size: {}", olang.len());

// Get index-encoded pairs for training
let pairs = dataset.pairs();
// pairs[0].0 = [45, 12, 8, 1]  (input sentence indices + EOS)
// pairs[0].1 = [23, 56, 3, 1]  (target sentence indices + EOS)

Related Pages

Principle:LaurentMazare_Tch_rs_Seq2Seq_Dataset_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment