Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove WordStats

From Leeroopedia

Template:Metadata

Overview

WordStats is a pipeline step that computes word-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated word-level statistics.

Signature

class WordStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        stop_words: list[str] = STOP_WORDS,
        short_word_max_chars_threshold: list[int] | None = None,
        long_word_max_chars_threshold: list[int] | None = None,
        language: str = Languages.english,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import WordStats

Parameters

Parameter Type Default Description
output_folder DataFolderLike (required) Path or data folder where per-rank statistics JSON files will be written.
stop_words list[str] STOP_WORDS List of stop words used for computing the stop word ratio. Defaults to the Gopher quality filter stop word list.
short_word_max_chars_threshold None [3] Character length thresholds for short word ratio computation. A word with length <= threshold is counted as short.
long_word_max_chars_threshold None [7] Character length thresholds for long word ratio computation. A word with length >= threshold is counted as long.
language str Languages.english Language identifier used to load the appropriate word tokenizer.
groups_to_compute list[GROUP] ["summary", "histogram", "fqdn", "suffix"] List of grouping strategies for aggregating statistics.
histogram_round_digits int 3 Number of decimal digits to round histogram bin values to, controlling histogram granularity.
top_k_config TopKConfig DEFAULT_TOP_K_CONFIG Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name Description
n_words Number of words in the document.
avg_word_length Average character length of words: sum(len(w) for w in words) / len(words).
avg_words_per_line Average number of words per line: len(words) / len(lines).
short_word_ratio_{chars} Ratio of words with length <= {chars} characters. One stat per threshold in short_word_max_chars_threshold.
long_word_ratio_{chars} Ratio of words with length >= {chars} characters. One stat per threshold in long_word_max_chars_threshold.
type_token_ratio Type-Token Ratio: len(set(words)) / len(words).
uppercase_word_ratio Ratio of words that are entirely uppercase (word.isupper()).
capitalized_word_ratio Ratio of words in title case (word.istitle()).
stop_word_ratio Ratio of words found in the configured stop word list.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects (tracking n, mean, variance, min, max). Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

  • Word tokenization uses load_word_tokenizer(self.language) to load a language-specific tokenizer rather than naive whitespace splitting.
  • Lines are obtained via doc.text.splitlines() for the avg_words_per_line computation.
  • The stop word list defaults to STOP_WORDS imported from the Gopher quality filter module.
  • Statistics are aggregated using MetricStats which implements Welford's online algorithm for numerically stable running mean and variance.
  • High-cardinality groups (fqdn, suffix) are truncated to the top-k keys (by document count) to manage memory in distributed settings.

Example Usage

from datatrove.pipeline.stats import WordStats

# Basic usage with default settings
word_stats = WordStats(
    output_folder="s3://my-bucket/stats/word_stats",
)

# Custom thresholds and language
word_stats = WordStats(
    output_folder="/data/stats/word_stats",
    short_word_max_chars_threshold=[2, 3, 4],
    long_word_max_chars_threshold=[7, 10],
    language="fr",
    groups_to_compute=["summary", "histogram"],
)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment