Implementation:Huggingface Datasets Languages Resource

Source	src/datasets/utils/resources/languages.json
Type	Resource Doc
Domain(s)	Metadata, Localization
Last Updated	2026-02-14

Overview

Description

The Languages Resource is a static JSON data file that provides a comprehensive mapping of ISO 639 language codes to their corresponding human-readable language names. The file contains approximately 8,025 entries covering the full breadth of the ISO 639 standard, including:

ISO 639-1 two-letter codes (e.g., "en" for English, "fr" for French, "zh" for Chinese).
ISO 639-2/3 three-letter codes (e.g., "eng" for English, "aaa" for Ghotuo, "zza" for Zaza).
A special "code" entry for programming languages (e.g., C++, Java, JavaScript, Python).

The file is structured as a single flat JSON object where each key is a language code string and each value is the corresponding language name string. Some entries include multiple names separated by semicolons (e.g., "zza": "Zaza; Dimili; Dimli (macrolanguage); Kirdki; Kirmanjki (macrolanguage); Zazaki").

This resource is consumed internally by the datasets library's metadata validation system. When dataset authors annotate their datasets with language tags in dataset cards or metadata configurations, the library validates these tags against this lookup to ensure they correspond to recognized language codes. This enables consistent, standardized language annotation across the Hugging Face ecosystem.

Usage

This file is not imported directly by user code. It is loaded internally by metadata validation utilities within the datasets library to validate language codes specified in dataset metadata (e.g., YAML frontmatter in dataset cards). Dataset authors interact with it indirectly by specifying language codes in their dataset configurations.

Code Reference

Source Location

src/datasets/utils/resources/languages.json (8,026 lines)

Structure

{
    "code": "Programming language (C++, Java, Javascript, Python, etc.)",
    "aa": "Afar",
    "aaa": "Ghotuo",
    "aab": "Alumu-Tesu",
    "ab": "Abkhazian",
    "en": "English",
    "fr": "French",
    "zh": "Chinese",
    "...": "... (approximately 8,025 entries total)",
    "zzj": "Zuojiang Zhuang"
}

Import

This resource is not imported directly. It is loaded at runtime by internal metadata utilities:

import json
from pathlib import Path

_LANGUAGES_JSON = Path(__file__).parent / "resources" / "languages.json"
with open(_LANGUAGES_JSON, encoding="utf-8") as f:
    LANGUAGES = json.load(f)

I/O Contract

Inputs

Name	Type	Description
(none)	--	This is a static resource file; it has no runtime inputs

Outputs

Name	Type	Description
Language mapping	`dict[str, str]`	A dictionary mapping ISO 639 language code strings to human-readable language name strings
Key format	`str`	2-letter (ISO 639-1) or 3-letter (ISO 639-2/3) language codes, plus the special `"code"` key
Value format	`str`	Language name(s), potentially semicolon-separated for languages with multiple names

Usage Examples

Validating a language code:

import json
from pathlib import Path

languages_path = Path("src/datasets/utils/resources/languages.json")
with open(languages_path, encoding="utf-8") as f:
    languages = json.load(f)

# Check if a language code is valid
assert "en" in languages        # True -- English
assert "fr" in languages        # True -- French
assert "xyz123" not in languages # True -- not a valid code

Looking up a language name:

languages["en"]
# "English"

languages["zh"]
# "Chinese"

languages["code"]
# "Programming language (C++, Java, Javascript, Python, etc.)"

Listing all two-letter (ISO 639-1) codes:

iso_639_1_codes = [code for code in languages if len(code) == 2]
# ["aa", "ab", "ae", "af", "ak", "am", "an", "ar", ...]

Related Pages

Principle: Language Code Registry -- The design principle governing how ISO 639 language codes are used to standardize language annotations in dataset metadata, enabling cross-dataset discovery and filtering by language.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment