Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Translation

From Leeroopedia

Overview

Translation defines two feature types for representing multilingual translation data in datasets: Translation for fixed language sets and TranslationVariableLanguages for variable language sets. Both are implemented as dataclasses and provide Arrow-compatible storage via PyArrow struct types. These feature types exist for compatibility with TensorFlow Datasets (tfds) and support encoding, flattening, and schema generation for multilingual corpora.

Source File

Property Value
Repository huggingface/datasets
File src/datasets/features/translation.py
Lines 129
Domain NLP, Multilingual

Import

from datasets import Translation, TranslationVariableLanguages

Class: Translation

Type: @dataclass

A feature type for translations with a fixed set of languages per example. Every example must contain translations for all specified languages.

Constructor

@dataclass
class Translation:
    languages: list[str]
    id: Optional[str] = field(default=None, repr=False)
Parameter Type Default Description
languages list[str] (required) List of language codes (e.g., ['en', 'fr', 'de'])
id Optional[str] None Optional identifier for the feature

Class Variables

Variable Value Description
dtype "dict" String representation of the decoded type
pa_type None Set to None; the actual type is generated dynamically by __call__
_type "Translation" Internal type identifier

Methods

__call__()

Returns a pa.struct with one pa.string() field per language, sorted alphabetically.

def __call__(self):
    return pa.struct({lang: pa.string() for lang in sorted(self.languages)})

flatten()

Flattens the feature into a dictionary mapping each language code to Value("string"), sorted alphabetically.

def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    from .features import Value
    return {k: Value("string") for k in sorted(self.languages)}

Class: TranslationVariableLanguages

Type: @dataclass

A feature type for translations with a variable set of languages per example. Not every example needs to contain all languages, and a single language can have multiple translations.

Constructor

@dataclass
class TranslationVariableLanguages:
    languages: Optional[list] = None
    num_languages: Optional[int] = None
    id: Optional[str] = field(default=None, repr=False)
Parameter Type Default Description
languages Optional[list] None List of valid language codes. Sorted and deduplicated in __post_init__.
num_languages Optional[int] None Auto-computed count of languages. Set in __post_init__.
id Optional[str] None Optional identifier for the feature.

Post-Initialization

The __post_init__ method sorts and deduplicates the language list, and computes num_languages:

def __post_init__(self):
    self.languages = sorted(set(self.languages)) if self.languages else None
    self.num_languages = len(self.languages) if self.languages else None

Class Variables

Variable Value Description
dtype "dict" String representation of the decoded type
pa_type None Set to None; the actual type is generated dynamically by __call__
_type "TranslationVariableLanguages" Internal type identifier

Methods

__call__()

Returns a pa.struct with two list fields: language (list of strings) and translation (list of strings).

def __call__(self):
    return pa.struct({"language": pa.list_(pa.string()), "translation": pa.list_(pa.string())})

encode_example(translation_dict)

Encodes a translation dictionary into the variable-language format. This method:

  1. If the input already has language and translation keys, passes it through.
  2. Validates that all language codes in the input are in the allowed set.
  3. Expands entries where a language maps to multiple translations (list of strings).
  4. Sorts the resulting tuples by language code in ascending order.
  5. Returns a dictionary with language and translation lists.
def encode_example(self, translation_dict):
    lang_set = set(self.languages)
    if set(translation_dict) == {"language", "translation"}:
        return translation_dict
    elif self.languages and set(translation_dict) - lang_set:
        raise ValueError(
            f"Some languages in example ({', '.join(sorted(set(translation_dict) - lang_set))}) are not in valid set ({', '.join(lang_set)})."
        )

    translation_tuples = []
    for lang, text in translation_dict.items():
        if isinstance(text, str):
            translation_tuples.append((lang, text))
        else:
            translation_tuples.extend([(lang, el) for el in text])

    languages, translations = zip(*sorted(translation_tuples))
    return {"language": languages, "translation": translations}

flatten()

Flattens the feature into a dictionary with language mapped to List(Value("string")) and translation mapped to List(Value("string")).

def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    from .features import List, Value
    return {
        "language": List(Value("string")),
        "translation": List(Value("string")),
    }

I/O

Direction Description
Input (Translation) Dictionary mapping language codes to translation strings (e.g., {"en": "the cat", "fr": "le chat"})
Output (Translation) Arrow struct with one string field per language
Input (TranslationVariableLanguages) Dictionary mapping language codes to one or more translation strings (e.g., {"en": "the cat", "fr": ["le chat", "la chatte"]})
Output (TranslationVariableLanguages) Arrow struct with language and translation list fields, sorted by language code

Dependencies

Module Purpose
pyarrow Arrow struct and list type definitions
dataclasses Dataclass decorator and field configuration

Usage

import datasets

# Fixed-language translation feature
trans = datasets.features.Translation(languages=['en', 'fr', 'de'])
# During data generation:
example = {'en': 'the cat', 'fr': 'le chat', 'de': 'die katze'}

# Variable-language translation feature
var_trans = datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de'])
# During data generation (some languages may be missing, some may have multiple translations):
example = {'en': 'the cat', 'fr': ['le chat', 'la chatte']}
encoded = var_trans.encode_example(example)
# Result: {'language': ('en', 'fr', 'fr'), 'translation': ('the cat', 'la chatte', 'le chat')}

Related Pages

Categories

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment