Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove TypesHelper

From Leeroopedia
Knowledge Sources
Domains Data Processing, NLP
Last Updated 2026-02-14 17:00 GMT

Overview

Defines constant enumerations for language codes, pipeline statistic keys, and file extension conventions used throughout the datatrove codebase.

Description

The TypesHelper module contains four classes that serve as constant registries for the entire datatrove framework. The primary class is Languages, which contains over 4,000 class-level string attributes mapping human-readable language names to ISO 639-3 codes with optional script tags. Each language may have multiple entries: one for the base language code (e.g., english = "eng") and one or more for specific scripts (e.g., english__latn = "eng_Latn"). The double-underscore convention in attribute names corresponds to the underscore separator in the ISO code values.

The StatHints class provides three standard string constants -- total, dropped, and forwarded -- used by virtually all pipeline steps to track document flow statistics consistently. These constants serve as keys for the statistics tracking system, ensuring uniform metric naming across all pipeline components.

The ExtensionHelperSD and ExtensionHelperES classes define file extension constants used by the Sentence Deduplication and Exact Substring deduplication pipelines respectively. These ensure consistent file naming when intermediate data structures (signatures, duplicate lists, byte ranges) are written to disk during multi-stage deduplication workflows.

Usage

Reference Languages constants when specifying language parameters for filters, tokenizers, or language identification steps. Use StatHints when updating statistics within custom pipeline steps. Use the extension helpers when working with deduplication pipeline file artifacts.

Code Reference

Source Location

Signature

class Languages:
    english = "eng"
    english__latn = "eng_Latn"
    # ... 4000+ language/script entries

class StatHints:
    total = "total"
    dropped = "dropped"
    forwarded = "forwarded"

class ExtensionHelperSD:
    stage_1_signature = ".c4_sig"
    stage_2_duplicates = ".c4_dup"
    stage_2_counts = ".c4_counts"
    index = ".c4_index"

class ExtensionHelperES:
    stage_1_sequence = ".es_sequence"
    stage_1_sequence_size = ".es_sequence.size"
    stage_2_big_sequence = ".big_sequence"
    stage_2_bytes_offset = ".info"
    stage_3_bytes_ranges = ".bytearange"

Import

from datatrove.utils.typeshelper import Languages, StatHints
from datatrove.utils.typeshelper import ExtensionHelperSD, ExtensionHelperES

I/O Contract

Inputs

Name Type Required Description
N/A N/A N/A This module defines only constants; it takes no runtime inputs

Outputs

Name Type Description
Language codes str ISO 639-3 codes with optional script tags (e.g., "eng_Latn")
Stat keys str Standard statistic key strings ("total", "dropped", "forwarded")
File extensions str Conventional file extension strings for deduplication artifacts

Usage Examples

Basic Usage

from datatrove.utils.typeshelper import Languages, StatHints

# Use language constants for pipeline configuration
language = Languages.english  # "eng"
language_with_script = Languages.french__latn  # "fra_Latn"

# Use stat hints in custom pipeline steps
self.stat_update(StatHints.total)
self.stat_update(StatHints.forwarded)
self.stat_update(StatHints.dropped)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment