Implementation:Huggingface Datasets Translation

Overview

Translation defines two feature types for representing multilingual translation data in datasets: Translation for fixed language sets and TranslationVariableLanguages for variable language sets. Both are implemented as dataclasses and provide Arrow-compatible storage via PyArrow struct types. These feature types exist for compatibility with TensorFlow Datasets (tfds) and support encoding, flattening, and schema generation for multilingual corpora.

Source File

Property	Value
Repository	huggingface/datasets
File	src/datasets/features/translation.py
Lines	129
Domain	NLP, Multilingual

Import

from datasets import Translation, TranslationVariableLanguages

Class: Translation

Type: @dataclass

A feature type for translations with a fixed set of languages per example. Every example must contain translations for all specified languages.

Constructor

@dataclass
class Translation:
    languages: list[str]
    id: Optional[str] = field(default=None, repr=False)

Parameter	Type	Default	Description
`languages`	`list[str]`	(required)	List of language codes (e.g., `['en', 'fr', 'de']`)
`id`	`Optional[str]`	`None`	Optional identifier for the feature

Class Variables

Variable	Value	Description
`dtype`	`"dict"`	String representation of the decoded type
`pa_type`	`None`	Set to `None`; the actual type is generated dynamically by `__call__`
`_type`	`"Translation"`	Internal type identifier

Methods

call()

Returns a pa.struct with one pa.string() field per language, sorted alphabetically.

def __call__(self):
    return pa.struct({lang: pa.string() for lang in sorted(self.languages)})

flatten()

Flattens the feature into a dictionary mapping each language code to Value("string"), sorted alphabetically.

def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    from .features import Value
    return {k: Value("string") for k in sorted(self.languages)}

Class: TranslationVariableLanguages

Type: @dataclass

A feature type for translations with a variable set of languages per example. Not every example needs to contain all languages, and a single language can have multiple translations.

Constructor

@dataclass
class TranslationVariableLanguages:
    languages: Optional[list] = None
    num_languages: Optional[int] = None
    id: Optional[str] = field(default=None, repr=False)

Parameter	Type	Default	Description
`languages`	`Optional[list]`	`None`	List of valid language codes. Sorted and deduplicated in `__post_init__`.
`num_languages`	`Optional[int]`	`None`	Auto-computed count of languages. Set in `__post_init__`.
`id`	`Optional[str]`	`None`	Optional identifier for the feature.

Post-Initialization

The __post_init__ method sorts and deduplicates the language list, and computes num_languages:

def __post_init__(self):
    self.languages = sorted(set(self.languages)) if self.languages else None
    self.num_languages = len(self.languages) if self.languages else None

Class Variables

Variable	Value	Description
`dtype`	`"dict"`	String representation of the decoded type
`pa_type`	`None`	Set to `None`; the actual type is generated dynamically by `__call__`
`_type`	`"TranslationVariableLanguages"`	Internal type identifier

Methods

call()

Returns a pa.struct with two list fields: language (list of strings) and translation (list of strings).

def __call__(self):
    return pa.struct({"language": pa.list_(pa.string()), "translation": pa.list_(pa.string())})

encode_example(translation_dict)

Encodes a translation dictionary into the variable-language format. This method:

If the input already has language and translation keys, passes it through.
Validates that all language codes in the input are in the allowed set.
Expands entries where a language maps to multiple translations (list of strings).
Sorts the resulting tuples by language code in ascending order.
Returns a dictionary with language and translation lists.

def encode_example(self, translation_dict):
    lang_set = set(self.languages)
    if set(translation_dict) == {"language", "translation"}:
        return translation_dict
    elif self.languages and set(translation_dict) - lang_set:
        raise ValueError(
            f"Some languages in example ({', '.join(sorted(set(translation_dict) - lang_set))}) are not in valid set ({', '.join(lang_set)})."
        )

    translation_tuples = []
    for lang, text in translation_dict.items():
        if isinstance(text, str):
            translation_tuples.append((lang, text))
        else:
            translation_tuples.extend([(lang, el) for el in text])

    languages, translations = zip(*sorted(translation_tuples))
    return {"language": languages, "translation": translations}

flatten()

Flattens the feature into a dictionary with language mapped to List(Value("string")) and translation mapped to List(Value("string")).

def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
    from .features import List, Value
    return {
        "language": List(Value("string")),
        "translation": List(Value("string")),
    }

I/O

Direction	Description
Input (Translation)	Dictionary mapping language codes to translation strings (e.g., `{"en": "the cat", "fr": "le chat"}`)
Output (Translation)	Arrow struct with one string field per language
Input (TranslationVariableLanguages)	Dictionary mapping language codes to one or more translation strings (e.g., `{"en": "the cat", "fr": ["le chat", "la chatte"]}`)
Output (TranslationVariableLanguages)	Arrow struct with `language` and `translation` list fields, sorted by language code

Dependencies

Module	Purpose
`pyarrow`	Arrow struct and list type definitions
`dataclasses`	Dataclass decorator and field configuration

Usage

import datasets

# Fixed-language translation feature
trans = datasets.features.Translation(languages=['en', 'fr', 'de'])
# During data generation:
example = {'en': 'the cat', 'fr': 'le chat', 'de': 'die katze'}

# Variable-language translation feature
var_trans = datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de'])
# During data generation (some languages may be missing, some may have multiple translations):
example = {'en': 'the cat', 'fr': ['le chat', 'la chatte']}
encoded = var_trans.encode_example(example)
# Result: {'language': ('en', 'fr', 'fr'), 'translation': ('the cat', 'la chatte', 'le chat')}

Related Pages

Huggingface_Datasets_Pdf

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

Source File

Import

Class: Translation

Constructor

Class Variables

Methods

__call__()

flatten()

Class: TranslationVariableLanguages

Constructor

Post-Initialization

Class Variables

Methods

__call__()

encode_example(translation_dict)

flatten()

I/O

Dependencies

Usage

Related Pages

Categories

Page Connections

call()

call()