Implementation:Huggingface Datasets Translation
Overview
Translation defines two feature types for representing multilingual translation data in datasets: Translation for fixed language sets and TranslationVariableLanguages for variable language sets. Both are implemented as dataclasses and provide Arrow-compatible storage via PyArrow struct types. These feature types exist for compatibility with TensorFlow Datasets (tfds) and support encoding, flattening, and schema generation for multilingual corpora.
Source File
| Property | Value |
|---|---|
| Repository | huggingface/datasets |
| File | src/datasets/features/translation.py |
| Lines | 129 |
| Domain | NLP, Multilingual |
Import
from datasets import Translation, TranslationVariableLanguages
Class: Translation
Type: @dataclass
A feature type for translations with a fixed set of languages per example. Every example must contain translations for all specified languages.
Constructor
@dataclass
class Translation:
languages: list[str]
id: Optional[str] = field(default=None, repr=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
languages |
list[str] |
(required) | List of language codes (e.g., ['en', 'fr', 'de'])
|
id |
Optional[str] |
None |
Optional identifier for the feature |
Class Variables
| Variable | Value | Description |
|---|---|---|
dtype |
"dict" |
String representation of the decoded type |
pa_type |
None |
Set to None; the actual type is generated dynamically by __call__
|
_type |
"Translation" |
Internal type identifier |
Methods
__call__()
Returns a pa.struct with one pa.string() field per language, sorted alphabetically.
def __call__(self):
return pa.struct({lang: pa.string() for lang in sorted(self.languages)})
flatten()
Flattens the feature into a dictionary mapping each language code to Value("string"), sorted alphabetically.
def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
from .features import Value
return {k: Value("string") for k in sorted(self.languages)}
Class: TranslationVariableLanguages
Type: @dataclass
A feature type for translations with a variable set of languages per example. Not every example needs to contain all languages, and a single language can have multiple translations.
Constructor
@dataclass
class TranslationVariableLanguages:
languages: Optional[list] = None
num_languages: Optional[int] = None
id: Optional[str] = field(default=None, repr=False)
| Parameter | Type | Default | Description |
|---|---|---|---|
languages |
Optional[list] |
None |
List of valid language codes. Sorted and deduplicated in __post_init__.
|
num_languages |
Optional[int] |
None |
Auto-computed count of languages. Set in __post_init__.
|
id |
Optional[str] |
None |
Optional identifier for the feature. |
Post-Initialization
The __post_init__ method sorts and deduplicates the language list, and computes num_languages:
def __post_init__(self):
self.languages = sorted(set(self.languages)) if self.languages else None
self.num_languages = len(self.languages) if self.languages else None
Class Variables
| Variable | Value | Description |
|---|---|---|
dtype |
"dict" |
String representation of the decoded type |
pa_type |
None |
Set to None; the actual type is generated dynamically by __call__
|
_type |
"TranslationVariableLanguages" |
Internal type identifier |
Methods
__call__()
Returns a pa.struct with two list fields: language (list of strings) and translation (list of strings).
def __call__(self):
return pa.struct({"language": pa.list_(pa.string()), "translation": pa.list_(pa.string())})
encode_example(translation_dict)
Encodes a translation dictionary into the variable-language format. This method:
- If the input already has
languageandtranslationkeys, passes it through. - Validates that all language codes in the input are in the allowed set.
- Expands entries where a language maps to multiple translations (list of strings).
- Sorts the resulting tuples by language code in ascending order.
- Returns a dictionary with
languageandtranslationlists.
def encode_example(self, translation_dict):
lang_set = set(self.languages)
if set(translation_dict) == {"language", "translation"}:
return translation_dict
elif self.languages and set(translation_dict) - lang_set:
raise ValueError(
f"Some languages in example ({', '.join(sorted(set(translation_dict) - lang_set))}) are not in valid set ({', '.join(lang_set)})."
)
translation_tuples = []
for lang, text in translation_dict.items():
if isinstance(text, str):
translation_tuples.append((lang, text))
else:
translation_tuples.extend([(lang, el) for el in text])
languages, translations = zip(*sorted(translation_tuples))
return {"language": languages, "translation": translations}
flatten()
Flattens the feature into a dictionary with language mapped to List(Value("string")) and translation mapped to List(Value("string")).
def flatten(self) -> Union["FeatureType", dict[str, "FeatureType"]]:
from .features import List, Value
return {
"language": List(Value("string")),
"translation": List(Value("string")),
}
I/O
| Direction | Description |
|---|---|
| Input (Translation) | Dictionary mapping language codes to translation strings (e.g., {"en": "the cat", "fr": "le chat"})
|
| Output (Translation) | Arrow struct with one string field per language |
| Input (TranslationVariableLanguages) | Dictionary mapping language codes to one or more translation strings (e.g., {"en": "the cat", "fr": ["le chat", "la chatte"]})
|
| Output (TranslationVariableLanguages) | Arrow struct with language and translation list fields, sorted by language code
|
Dependencies
| Module | Purpose |
|---|---|
pyarrow |
Arrow struct and list type definitions |
dataclasses |
Dataclass decorator and field configuration |
Usage
import datasets
# Fixed-language translation feature
trans = datasets.features.Translation(languages=['en', 'fr', 'de'])
# During data generation:
example = {'en': 'the cat', 'fr': 'le chat', 'de': 'die katze'}
# Variable-language translation feature
var_trans = datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de'])
# During data generation (some languages may be missing, some may have multiple translations):
example = {'en': 'the cat', 'fr': ['le chat', 'la chatte']}
encoded = var_trans.encode_example(example)
# Result: {'language': ('en', 'fr', 'fr'), 'translation': ('the cat', 'la chatte', 'le chat')}