Implementation:Huggingface Datasets Languages Resource
| Source | src/datasets/utils/resources/languages.json |
|---|---|
| Type | Resource Doc |
| Domain(s) | Metadata, Localization |
| Last Updated | 2026-02-14 |
Overview
Description
The Languages Resource is a static JSON data file that provides a comprehensive mapping of ISO 639 language codes to their corresponding human-readable language names. The file contains approximately 8,025 entries covering the full breadth of the ISO 639 standard, including:
- ISO 639-1 two-letter codes (e.g.,
"en"for English,"fr"for French,"zh"for Chinese). - ISO 639-2/3 three-letter codes (e.g.,
"eng"for English,"aaa"for Ghotuo,"zza"for Zaza). - A special
"code"entry for programming languages (e.g., C++, Java, JavaScript, Python).
The file is structured as a single flat JSON object where each key is a language code string and each value is the corresponding language name string. Some entries include multiple names separated by semicolons (e.g., "zza": "Zaza; Dimili; Dimli (macrolanguage); Kirdki; Kirmanjki (macrolanguage); Zazaki").
This resource is consumed internally by the datasets library's metadata validation system. When dataset authors annotate their datasets with language tags in dataset cards or metadata configurations, the library validates these tags against this lookup to ensure they correspond to recognized language codes. This enables consistent, standardized language annotation across the Hugging Face ecosystem.
Usage
This file is not imported directly by user code. It is loaded internally by metadata validation utilities within the datasets library to validate language codes specified in dataset metadata (e.g., YAML frontmatter in dataset cards). Dataset authors interact with it indirectly by specifying language codes in their dataset configurations.
Code Reference
Source Location
src/datasets/utils/resources/languages.json (8,026 lines)
Structure
{
"code": "Programming language (C++, Java, Javascript, Python, etc.)",
"aa": "Afar",
"aaa": "Ghotuo",
"aab": "Alumu-Tesu",
"ab": "Abkhazian",
"en": "English",
"fr": "French",
"zh": "Chinese",
"...": "... (approximately 8,025 entries total)",
"zzj": "Zuojiang Zhuang"
}
Import
This resource is not imported directly. It is loaded at runtime by internal metadata utilities:
import json
from pathlib import Path
_LANGUAGES_JSON = Path(__file__).parent / "resources" / "languages.json"
with open(_LANGUAGES_JSON, encoding="utf-8") as f:
LANGUAGES = json.load(f)
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
| (none) | -- | This is a static resource file; it has no runtime inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| Language mapping | dict[str, str] |
A dictionary mapping ISO 639 language code strings to human-readable language name strings |
| Key format | str |
2-letter (ISO 639-1) or 3-letter (ISO 639-2/3) language codes, plus the special "code" key
|
| Value format | str |
Language name(s), potentially semicolon-separated for languages with multiple names |
Usage Examples
Validating a language code:
import json
from pathlib import Path
languages_path = Path("src/datasets/utils/resources/languages.json")
with open(languages_path, encoding="utf-8") as f:
languages = json.load(f)
# Check if a language code is valid
assert "en" in languages # True -- English
assert "fr" in languages # True -- French
assert "xyz123" not in languages # True -- not a valid code
Looking up a language name:
languages["en"]
# "English"
languages["zh"]
# "Chinese"
languages["code"]
# "Programming language (C++, Java, Javascript, Python, etc.)"
Listing all two-letter (ISO 639-1) codes:
iso_639_1_codes = [code for code in languages if len(code) == 2]
# ["aa", "ab", "ae", "af", "ak", "am", "an", "ar", ...]
Related Pages
- Principle: Language Code Registry -- The design principle governing how ISO 639 language codes are used to standardize language annotations in dataset metadata, enabling cross-dataset discovery and filtering by language.