Implementation:Huggingface Datatrove SentenceStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Statistics |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
SentenceStats is a statistics pipeline step that computes sentence-level structural metrics for documents, including sentence counts, average lengths, and short/long sentence ratios.
Description
SentenceStats extends BaseStats to analyze the sentence structure of documents. Unlike paragraph statistics which use simple newline splitting, sentence detection uses a language-aware word tokenizer loaded via load_word_tokenizer and its sent_tokenize method, which provides more linguistically accurate sentence boundary detection.
The class computes four categories of metrics: n_sentences (total count of non-empty sentences), avg_sentence_length (mean character length of sentences), short_sentence_ratio_{chars} (fraction of sentences at or below a character threshold), and long_sentence_ratio_{chars} (fraction of sentences at or above a character threshold). Both threshold lists are configurable, defaulting to [20] for short sentences and [75] for long sentences.
The language parameter controls which sentence tokenizer is used, defaulting to English. This is important because sentence boundary rules differ significantly across languages (e.g., period usage, quotation conventions, abbreviation handling).
Usage
Use SentenceStats when you need to profile the sentence-level structure of documents. It is useful for detecting documents with abnormal sentence distributions, such as those dominated by very short fragments (navigation text, lists) or unusually long sentences (run-on text, poorly extracted content).
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/sentence_stats.py
- Lines: 1-69
Signature
class SentenceStats(BaseStats):
name = "🈂️ Sentence stats"
def __init__(
self,
output_folder: DataFolderLike,
short_sentence_max_chars_threshold: list[int] | None = None,
long_sentence_max_chars_threshold: list[int] | None = None,
language: str = Languages.english,
histogram_round_digits: int = 3,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
Import
from datatrove.pipeline.stats.sentence_stats import SentenceStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics will be saved |
| short_sentence_max_chars_threshold | list[int] or None | No | Character thresholds for classifying short sentences (default: [20]) |
| long_sentence_max_chars_threshold | list[int] or None | No | Character thresholds for classifying long sentences (default: [75]) |
| language | str | No | Language for sentence tokenization (default: Languages.english) |
| histogram_round_digits | int | No | Decimal digits for histogram rounding (default: 3) |
| groups_to_compute | list[GROUP] | No | Grouping strategies for statistics (default: all groups) |
| top_k_config | TopKConfig | No | Top-K configuration for high-cardinality groups |
Outputs
| Name | Type | Description |
|---|---|---|
| n_sentences | int | Total number of non-empty sentences in the document |
| avg_sentence_length | float | Average character length of sentences |
| short_sentence_ratio_{chars} | float | Fraction of sentences with length at or below the threshold |
| long_sentence_ratio_{chars} | float | Fraction of sentences with length at or above the threshold |
Usage Examples
Basic Usage
from datatrove.pipeline.stats.sentence_stats import SentenceStats
stats = SentenceStats(
output_folder="output/stats/",
)
Custom Thresholds and Language
from datatrove.pipeline.stats.sentence_stats import SentenceStats
from datatrove.utils.typeshelper import Languages
stats = SentenceStats(
output_folder="output/stats/",
short_sentence_max_chars_threshold=[10, 20, 50],
long_sentence_max_chars_threshold=[100, 200],
language=Languages.german,
)