Implementation:Huggingface Datatrove TypesHelper
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, NLP |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Defines constant enumerations for language codes, pipeline statistic keys, and file extension conventions used throughout the datatrove codebase.
Description
The TypesHelper module contains four classes that serve as constant registries for the entire datatrove framework. The primary class is Languages, which contains over 4,000 class-level string attributes mapping human-readable language names to ISO 639-3 codes with optional script tags. Each language may have multiple entries: one for the base language code (e.g., english = "eng") and one or more for specific scripts (e.g., english__latn = "eng_Latn"). The double-underscore convention in attribute names corresponds to the underscore separator in the ISO code values.
The StatHints class provides three standard string constants -- total, dropped, and forwarded -- used by virtually all pipeline steps to track document flow statistics consistently. These constants serve as keys for the statistics tracking system, ensuring uniform metric naming across all pipeline components.
The ExtensionHelperSD and ExtensionHelperES classes define file extension constants used by the Sentence Deduplication and Exact Substring deduplication pipelines respectively. These ensure consistent file naming when intermediate data structures (signatures, duplicate lists, byte ranges) are written to disk during multi-stage deduplication workflows.
Usage
Reference Languages constants when specifying language parameters for filters, tokenizers, or language identification steps. Use StatHints when updating statistics within custom pipeline steps. Use the extension helpers when working with deduplication pipeline file artifacts.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/utils/typeshelper.py
- Lines: 1-4113
Signature
class Languages:
english = "eng"
english__latn = "eng_Latn"
# ... 4000+ language/script entries
class StatHints:
total = "total"
dropped = "dropped"
forwarded = "forwarded"
class ExtensionHelperSD:
stage_1_signature = ".c4_sig"
stage_2_duplicates = ".c4_dup"
stage_2_counts = ".c4_counts"
index = ".c4_index"
class ExtensionHelperES:
stage_1_sequence = ".es_sequence"
stage_1_sequence_size = ".es_sequence.size"
stage_2_big_sequence = ".big_sequence"
stage_2_bytes_offset = ".info"
stage_3_bytes_ranges = ".bytearange"
Import
from datatrove.utils.typeshelper import Languages, StatHints
from datatrove.utils.typeshelper import ExtensionHelperSD, ExtensionHelperES
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| N/A | N/A | N/A | This module defines only constants; it takes no runtime inputs |
Outputs
| Name | Type | Description |
|---|---|---|
| Language codes | str | ISO 639-3 codes with optional script tags (e.g., "eng_Latn") |
| Stat keys | str | Standard statistic key strings ("total", "dropped", "forwarded") |
| File extensions | str | Conventional file extension strings for deduplication artifacts |
Usage Examples
Basic Usage
from datatrove.utils.typeshelper import Languages, StatHints
# Use language constants for pipeline configuration
language = Languages.english # "eng"
language_with_script = Languages.french__latn # "fra_Latn"
# Use stat hints in custom pipeline steps
self.stat_update(StatHints.total)
self.stat_update(StatHints.forwarded)
self.stat_update(StatHints.dropped)