Implementation:Huggingface Datatrove SentenceStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Statistics
Last Updated	2026-02-14 17:00 GMT

Overview

SentenceStats is a statistics pipeline step that computes sentence-level structural metrics for documents, including sentence counts, average lengths, and short/long sentence ratios.

Description

SentenceStats extends BaseStats to analyze the sentence structure of documents. Unlike paragraph statistics which use simple newline splitting, sentence detection uses a language-aware word tokenizer loaded via load_word_tokenizer and its sent_tokenize method, which provides more linguistically accurate sentence boundary detection.

The class computes four categories of metrics: n_sentences (total count of non-empty sentences), avg_sentence_length (mean character length of sentences), short_sentence_ratio_{chars} (fraction of sentences at or below a character threshold), and long_sentence_ratio_{chars} (fraction of sentences at or above a character threshold). Both threshold lists are configurable, defaulting to [20] for short sentences and [75] for long sentences.

The language parameter controls which sentence tokenizer is used, defaulting to English. This is important because sentence boundary rules differ significantly across languages (e.g., period usage, quotation conventions, abbreviation handling).

Usage

Use SentenceStats when you need to profile the sentence-level structure of documents. It is useful for detecting documents with abnormal sentence distributions, such as those dominated by very short fragments (navigation text, lists) or unusually long sentences (run-on text, poorly extracted content).

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/sentence_stats.py
Lines: 1-69

Signature

class SentenceStats(BaseStats):
    name = "🈂️ Sentence stats"

    def __init__(
        self,
        output_folder: DataFolderLike,
        short_sentence_max_chars_threshold: list[int] | None = None,
        long_sentence_max_chars_threshold: list[int] | None = None,
        language: str = Languages.english,
        histogram_round_digits: int = 3,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.sentence_stats import SentenceStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics will be saved
short_sentence_max_chars_threshold	list[int] or None	No	Character thresholds for classifying short sentences (default: [20])
long_sentence_max_chars_threshold	list[int] or None	No	Character thresholds for classifying long sentences (default: [75])
language	str	No	Language for sentence tokenization (default: Languages.english)
histogram_round_digits	int	No	Decimal digits for histogram rounding (default: 3)
groups_to_compute	list[GROUP]	No	Grouping strategies for statistics (default: all groups)
top_k_config	TopKConfig	No	Top-K configuration for high-cardinality groups

Outputs

Name	Type	Description
n_sentences	int	Total number of non-empty sentences in the document
avg_sentence_length	float	Average character length of sentences
short_sentence_ratio_{chars}	float	Fraction of sentences with length at or below the threshold
long_sentence_ratio_{chars}	float	Fraction of sentences with length at or above the threshold

Usage Examples

Basic Usage

from datatrove.pipeline.stats.sentence_stats import SentenceStats

stats = SentenceStats(
    output_folder="output/stats/",
)

Custom Thresholds and Language

from datatrove.pipeline.stats.sentence_stats import SentenceStats
from datatrove.utils.typeshelper import Languages

stats = SentenceStats(
    output_folder="output/stats/",
    short_sentence_max_chars_threshold=[10, 20, 50],
    long_sentence_max_chars_threshold=[100, 200],
    language=Languages.german,
)

Related Pages

Principle:Huggingface_Datatrove_Sentence_Level_Statistics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment