Principle:Huggingface Datatrove Word Level Statistics

Overview

Computing word-level statistical metrics for documents to characterize text quality and composition.

Description

Word-level statistics measure properties of the word distribution within documents. These metrics capture fundamental properties of text that correlate with quality, naturalness, and content type. The following metrics are computed per document:

Word count (n_words): The total number of words, providing a basic measure of document length at the word level.
Average word length (avg_word_length): The mean character length of words, which varies by language and content type (e.g., technical text tends to have longer words).
Average words per line (avg_words_per_line): The mean number of words per line, useful for detecting documents with unusual line structure such as keyword lists or tabular data.
Short word ratio (short_word_ratio_{chars}): The fraction of words with length at or below a configurable character threshold. High ratios may indicate boilerplate or formulaic content.
Long word ratio (long_word_ratio_{chars}): The fraction of words with length at or above a configurable character threshold. Extremely high ratios may indicate URL-heavy or concatenated text.
Type-Token Ratio (type_token_ratio): The ratio of unique words (types) to total words (tokens), serving as a measure of lexical diversity. Low TTR suggests repetitive vocabulary.
Stop word ratio (stop_word_ratio): The fraction of words that are stop words (common function words such as "the", "is", "at"). Natural language text typically has a characteristic stop word frequency; very low ratios may indicate non-natural content.
Uppercase word ratio (uppercase_word_ratio): The fraction of words that are entirely uppercase, which may indicate shouting, acronyms, or headers.
Capitalized word ratio (capitalized_word_ratio): The fraction of words in title case (first letter capitalized), useful for detecting title-heavy or proper-noun-heavy content.

These metrics collectively help identify outlier documents and characterize dataset composition. They are used as part of a summary statistics pipeline for dataset quality analysis.

Usage

Word-level statistics are computed as part of the summary statistics pipeline for dataset quality analysis. Each document is processed individually, and the resulting per-document metrics are aggregated across the corpus using online (streaming) statistics. The output can be merged across distributed workers using the Statistics Merging principle.

Typical use cases include:

Identifying outlier documents with abnormal word distributions
Characterizing the composition of large web-crawl datasets
Establishing thresholds for downstream quality filtering
Comparing datasets or dataset versions quantitatively

Theoretical Basis

The Type-Token Ratio (TTR) is a classical measure of lexical diversity from computational linguistics. It is computed as:

TTR = |unique words| / |total words|

TTR values closer to 1.0 indicate high lexical diversity (many unique words), while values closer to 0.0 indicate high repetition. Note that TTR is sensitive to document length: longer documents tend to have lower TTR because the number of unique words grows sub-linearly with document length.

Stop word frequency serves as a natural language indicator. Stop words are the most common function words in a language (articles, prepositions, conjunctions). Natural prose typically has a stop word ratio in a characteristic range (roughly 0.4-0.6 for English). Documents with very low stop word ratios often contain non-natural content such as code, keywords, or machine-generated text.

Per-document statistics are tracked using Welford's online algorithm for computing running mean and variance in a single pass, enabling efficient aggregation without storing raw values. The aggregated (n, mean, variance) tuples can later be combined across shards using the parallel variance formula.

Word splitting relies on language-specific tokenizers loaded via load_word_tokenizer(language), ensuring accurate word boundaries for the target language rather than naive whitespace splitting.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment