Implementation:Huggingface Datatrove WordStats

Overview

WordStats is a pipeline step that computes word-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated word-level statistics.

Signature

class WordStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        stop_words: list[str] = STOP_WORDS,
        short_word_max_chars_threshold: list[int] | None = None,
        long_word_max_chars_threshold: list[int] | None = None,
        language: str = Languages.english,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import WordStats

Parameters

Parameter	Type	Default	Description
`output_folder`	`DataFolderLike`	(required)	Path or data folder where per-rank statistics JSON files will be written.
`stop_words`	`list[str]`	`STOP_WORDS`	List of stop words used for computing the stop word ratio. Defaults to the Gopher quality filter stop word list.
`short_word_max_chars_threshold`	None	`[3]`	Character length thresholds for short word ratio computation. A word with length <= threshold is counted as short.
`long_word_max_chars_threshold`	None	`[7]`	Character length thresholds for long word ratio computation. A word with length >= threshold is counted as long.
`language`	`str`	`Languages.english`	Language identifier used to load the appropriate word tokenizer.
`groups_to_compute`	`list[GROUP]`	`["summary", "histogram", "fqdn", "suffix"]`	List of grouping strategies for aggregating statistics.
`histogram_round_digits`	`int`	`3`	Number of decimal digits to round histogram bin values to, controlling histogram granularity.
`top_k_config`	`TopKConfig`	`DEFAULT_TOP_K_CONFIG`	Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name	Description
`n_words`	Number of words in the document.
`avg_word_length`	Average character length of words: `sum(len(w) for w in words) / len(words)`.
`avg_words_per_line`	Average number of words per line: `len(words) / len(lines)`.
`short_word_ratio_{chars}`	Ratio of words with length <= `{chars}` characters. One stat per threshold in `short_word_max_chars_threshold`.
`long_word_ratio_{chars}`	Ratio of words with length >= `{chars}` characters. One stat per threshold in `long_word_max_chars_threshold`.
`type_token_ratio`	Type-Token Ratio: `len(set(words)) / len(words)`.
`uppercase_word_ratio`	Ratio of words that are entirely uppercase (`word.isupper()`).
`capitalized_word_ratio`	Ratio of words in title case (`word.istitle()`).
`stop_word_ratio`	Ratio of words found in the configured stop word list.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects (tracking n, mean, variance, min, max). Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

Word tokenization uses load_word_tokenizer(self.language) to load a language-specific tokenizer rather than naive whitespace splitting.
Lines are obtained via doc.text.splitlines() for the avg_words_per_line computation.
The stop word list defaults to STOP_WORDS imported from the Gopher quality filter module.
Statistics are aggregated using MetricStats which implements Welford's online algorithm for numerically stable running mean and variance.
High-cardinality groups (fqdn, suffix) are truncated to the top-k keys (by document count) to manage memory in distributed settings.

Example Usage

from datatrove.pipeline.stats import WordStats

# Basic usage with default settings
word_stats = WordStats(
    output_folder="s3://my-bucket/stats/word_stats",
)

# Custom thresholds and language
word_stats = WordStats(
    output_folder="/data/stats/word_stats",
    short_word_max_chars_threshold=[2, 3, 4],
    long_word_max_chars_threshold=[7, 10],
    language="fr",
    groups_to_compute=["summary", "histogram"],
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment