Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Languages Resource

From Leeroopedia
Source src/datasets/utils/resources/languages.json
Type Resource Doc
Domain(s) Metadata, Localization
Last Updated 2026-02-14

Overview

Description

The Languages Resource is a static JSON data file that provides a comprehensive mapping of ISO 639 language codes to their corresponding human-readable language names. The file contains approximately 8,025 entries covering the full breadth of the ISO 639 standard, including:

  • ISO 639-1 two-letter codes (e.g., "en" for English, "fr" for French, "zh" for Chinese).
  • ISO 639-2/3 three-letter codes (e.g., "eng" for English, "aaa" for Ghotuo, "zza" for Zaza).
  • A special "code" entry for programming languages (e.g., C++, Java, JavaScript, Python).

The file is structured as a single flat JSON object where each key is a language code string and each value is the corresponding language name string. Some entries include multiple names separated by semicolons (e.g., "zza": "Zaza; Dimili; Dimli (macrolanguage); Kirdki; Kirmanjki (macrolanguage); Zazaki").

This resource is consumed internally by the datasets library's metadata validation system. When dataset authors annotate their datasets with language tags in dataset cards or metadata configurations, the library validates these tags against this lookup to ensure they correspond to recognized language codes. This enables consistent, standardized language annotation across the Hugging Face ecosystem.

Usage

This file is not imported directly by user code. It is loaded internally by metadata validation utilities within the datasets library to validate language codes specified in dataset metadata (e.g., YAML frontmatter in dataset cards). Dataset authors interact with it indirectly by specifying language codes in their dataset configurations.

Code Reference

Source Location

src/datasets/utils/resources/languages.json (8,026 lines)

Structure

{
    "code": "Programming language (C++, Java, Javascript, Python, etc.)",
    "aa": "Afar",
    "aaa": "Ghotuo",
    "aab": "Alumu-Tesu",
    "ab": "Abkhazian",
    "en": "English",
    "fr": "French",
    "zh": "Chinese",
    "...": "... (approximately 8,025 entries total)",
    "zzj": "Zuojiang Zhuang"
}

Import

This resource is not imported directly. It is loaded at runtime by internal metadata utilities:

import json
from pathlib import Path

_LANGUAGES_JSON = Path(__file__).parent / "resources" / "languages.json"
with open(_LANGUAGES_JSON, encoding="utf-8") as f:
    LANGUAGES = json.load(f)

I/O Contract

Inputs

Name Type Description
(none) -- This is a static resource file; it has no runtime inputs

Outputs

Name Type Description
Language mapping dict[str, str] A dictionary mapping ISO 639 language code strings to human-readable language name strings
Key format str 2-letter (ISO 639-1) or 3-letter (ISO 639-2/3) language codes, plus the special "code" key
Value format str Language name(s), potentially semicolon-separated for languages with multiple names

Usage Examples

Validating a language code:

import json
from pathlib import Path

languages_path = Path("src/datasets/utils/resources/languages.json")
with open(languages_path, encoding="utf-8") as f:
    languages = json.load(f)

# Check if a language code is valid
assert "en" in languages        # True -- English
assert "fr" in languages        # True -- French
assert "xyz123" not in languages # True -- not a valid code

Looking up a language name:

languages["en"]
# "English"

languages["zh"]
# "Chinese"

languages["code"]
# "Programming language (C++, Java, Javascript, Python, etc.)"

Listing all two-letter (ISO 639-1) codes:

iso_639_1_codes = [code for code in languages if len(code) == 2]
# ["aa", "ab", "ae", "af", "ak", "am", "an", "ar", ...]

Related Pages

  • Principle: Language Code Registry -- The design principle governing how ISO 639 language codes are used to standardize language annotations in dataset metadata, enabling cross-dataset discovery and filtering by language.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment