Implementation:Huggingface Datatrove WordStats
Overview
WordStats is a pipeline step that computes word-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated word-level statistics.
Signature
class WordStats(BaseStats):
def __init__(
self,
output_folder: DataFolderLike,
stop_words: list[str] = STOP_WORDS,
short_word_max_chars_threshold: list[int] | None = None,
long_word_max_chars_threshold: list[int] | None = None,
language: str = Languages.english,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None:
Import
from datatrove.pipeline.stats import WordStats
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
output_folder |
DataFolderLike |
(required) | Path or data folder where per-rank statistics JSON files will be written. |
stop_words |
list[str] |
STOP_WORDS |
List of stop words used for computing the stop word ratio. Defaults to the Gopher quality filter stop word list. |
short_word_max_chars_threshold |
None | [3] |
Character length thresholds for short word ratio computation. A word with length <= threshold is counted as short. |
long_word_max_chars_threshold |
None | [7] |
Character length thresholds for long word ratio computation. A word with length >= threshold is counted as long. |
language |
str |
Languages.english |
Language identifier used to load the appropriate word tokenizer. |
groups_to_compute |
list[GROUP] |
["summary", "histogram", "fqdn", "suffix"] |
List of grouping strategies for aggregating statistics. |
histogram_round_digits |
int |
3 |
Number of decimal digits to round histogram bin values to, controlling histogram granularity. |
top_k_config |
TopKConfig |
DEFAULT_TOP_K_CONFIG |
Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys. |
Available Stats
The following statistics are extracted per document via the extract_stats method:
| Stat Name | Description |
|---|---|
n_words |
Number of words in the document. |
avg_word_length |
Average character length of words: sum(len(w) for w in words) / len(words).
|
avg_words_per_line |
Average number of words per line: len(words) / len(lines).
|
short_word_ratio_{chars} |
Ratio of words with length <= {chars} characters. One stat per threshold in short_word_max_chars_threshold.
|
long_word_ratio_{chars} |
Ratio of words with length >= {chars} characters. One stat per threshold in long_word_max_chars_threshold.
|
type_token_ratio |
Type-Token Ratio: len(set(words)) / len(words).
|
uppercase_word_ratio |
Ratio of words that are entirely uppercase (word.isupper()).
|
capitalized_word_ratio |
Ratio of words in title case (word.istitle()).
|
stop_word_ratio |
Ratio of words found in the configured stop word list. |
I/O
Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).
Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects (tracking n, mean, variance, min, max). Documents are yielded downstream with extracted stats added to doc.metadata.
Key Implementation Details
- Word tokenization uses
load_word_tokenizer(self.language)to load a language-specific tokenizer rather than naive whitespace splitting. - Lines are obtained via
doc.text.splitlines()for theavg_words_per_linecomputation. - The stop word list defaults to
STOP_WORDSimported from the Gopher quality filter module. - Statistics are aggregated using
MetricStatswhich implements Welford's online algorithm for numerically stable running mean and variance. - High-cardinality groups (
fqdn,suffix) are truncated to the top-k keys (by document count) to manage memory in distributed settings.
Example Usage
from datatrove.pipeline.stats import WordStats
# Basic usage with default settings
word_stats = WordStats(
output_folder="s3://my-bucket/stats/word_stats",
)
# Custom thresholds and language
word_stats = WordStats(
output_folder="/data/stats/word_stats",
short_word_max_chars_threshold=[2, 3, 4],
long_word_max_chars_threshold=[7, 10],
language="fr",
groups_to_compute=["summary", "histogram"],
)