Implementation:LaurentMazare Tch rs Translation Dataset
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Dataset Loading, Sequence to Sequence |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Loads and preprocesses parallel text datasets for sequence-to-sequence translation, including sentence normalization, length filtering, prefix filtering, and vocabulary construction.
Description
The Dataset struct manages paired sentences for translation tasks. It reads tab-separated sentence pairs from a text file (e.g., data/eng-fra.txt), applies normalization (lowercasing, separating punctuation, removing non-alphanumeric characters), and filters pairs based on maximum sentence length and whether either sentence begins with a set of common English prefixes (e.g., "i am", "you are", "he is", etc.).
The normalization function normalize converts all characters to lowercase and replaces non-alphanumeric characters with spaces, while separating punctuation marks (!, ., ?) with leading spaces for tokenization. The prefix filter ensures the dataset focuses on simple conversational sentences.
Two Lang vocabulary objects are built automatically: one for the input language and one for the output language. The pairs method converts sentence pairs into index sequences using their respective vocabularies, appending an EOS token to each sequence. The reverse method swaps input and output languages and flips all sentence pairs, which is used to train translation in the opposite direction (e.g., French-to-English instead of English-to-French).
Usage
Use this dataset loader for seq2seq translation training. It expects a tab-delimited text file at data/{ilang}-{olang}.txt. Call reverse() to swap translation direction. The returned index pairs can be directly used for training encoder-decoder models.
Code Reference
Source Location
- Repository: LaurentMazare_Tch_rs
- File: examples/translation/dataset.rs
- Lines: 1-115
Signature
#[derive(Debug)]
pub struct Dataset {
input_lang: lang::Lang,
output_lang: lang::Lang,
pairs: Vec<(String, String)>,
}
impl Dataset {
pub fn new(ilang: &str, olang: &str, max_length: usize) -> Result<Dataset>
pub fn input_lang(&self) -> &lang::Lang
pub fn output_lang(&self) -> &lang::Lang
pub fn reverse(self) -> Self
pub fn pairs(&self) -> Vec<(Vec<usize>, Vec<usize>)>
}
fn normalize(s: &str) -> String
fn to_indexes(s: &str, lang: &lang::Lang) -> Vec<usize>
fn filter_prefix(s: &str) -> bool
fn read_pairs(ilang: &str, olang: &str, max_length: usize) -> Result<Vec<(String, String)>>
Import
use super::lang;
use anyhow::{bail, Result};
use std::fs::File;
use std::io::{BufRead, BufReader};
I/O Contract
| Input | Type | Description |
|---|---|---|
| ilang | &str | Input language code (e.g., "eng") |
| olang | &str | Output language code (e.g., "fra") |
| max_length | usize | Maximum number of words per sentence (sentences exceeding this are filtered out) |
| Data file | Text file | Tab-separated sentence pairs at data/{ilang}-{olang}.txt |
| Output | Type | Description |
|---|---|---|
| Dataset.input_lang() | &Lang | Vocabulary for the input language |
| Dataset.output_lang() | &Lang | Vocabulary for the output language |
| Dataset.pairs() | Vec<(Vec<usize>, Vec<usize>)> | Index-encoded sentence pairs with EOS tokens appended |
| Dataset.reverse() | Dataset | New dataset with swapped input/output languages and flipped pairs |
Usage Examples
use dataset::Dataset;
// Load English-French dataset with max 10 words per sentence
let dataset = Dataset::new("eng", "fra", 10)?;
// Reverse to train French-to-English translation
let dataset = dataset.reverse();
// Access vocabularies
let ilang = dataset.input_lang();
let olang = dataset.output_lang();
println!("Input vocab size: {}", ilang.len());
println!("Output vocab size: {}", olang.len());
// Get index-encoded pairs for training
let pairs = dataset.pairs();
// pairs[0].0 = [45, 12, 8, 1] (input sentence indices + EOS)
// pairs[0].1 = [23, 56, 3, 1] (target sentence indices + EOS)