Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove SentenceStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Statistics
Last Updated 2026-02-14 17:00 GMT

Overview

SentenceStats is a statistics pipeline step that computes sentence-level structural metrics for documents, including sentence counts, average lengths, and short/long sentence ratios.

Description

SentenceStats extends BaseStats to analyze the sentence structure of documents. Unlike paragraph statistics which use simple newline splitting, sentence detection uses a language-aware word tokenizer loaded via load_word_tokenizer and its sent_tokenize method, which provides more linguistically accurate sentence boundary detection.

The class computes four categories of metrics: n_sentences (total count of non-empty sentences), avg_sentence_length (mean character length of sentences), short_sentence_ratio_{chars} (fraction of sentences at or below a character threshold), and long_sentence_ratio_{chars} (fraction of sentences at or above a character threshold). Both threshold lists are configurable, defaulting to [20] for short sentences and [75] for long sentences.

The language parameter controls which sentence tokenizer is used, defaulting to English. This is important because sentence boundary rules differ significantly across languages (e.g., period usage, quotation conventions, abbreviation handling).

Usage

Use SentenceStats when you need to profile the sentence-level structure of documents. It is useful for detecting documents with abnormal sentence distributions, such as those dominated by very short fragments (navigation text, lists) or unusually long sentences (run-on text, poorly extracted content).

Code Reference

Source Location

Signature

class SentenceStats(BaseStats):
    name = "🈂️ Sentence stats"

    def __init__(
        self,
        output_folder: DataFolderLike,
        short_sentence_max_chars_threshold: list[int] | None = None,
        long_sentence_max_chars_threshold: list[int] | None = None,
        language: str = Languages.english,
        histogram_round_digits: int = 3,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.sentence_stats import SentenceStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics will be saved
short_sentence_max_chars_threshold list[int] or None No Character thresholds for classifying short sentences (default: [20])
long_sentence_max_chars_threshold list[int] or None No Character thresholds for classifying long sentences (default: [75])
language str No Language for sentence tokenization (default: Languages.english)
histogram_round_digits int No Decimal digits for histogram rounding (default: 3)
groups_to_compute list[GROUP] No Grouping strategies for statistics (default: all groups)
top_k_config TopKConfig No Top-K configuration for high-cardinality groups

Outputs

Name Type Description
n_sentences int Total number of non-empty sentences in the document
avg_sentence_length float Average character length of sentences
short_sentence_ratio_{chars} float Fraction of sentences with length at or below the threshold
long_sentence_ratio_{chars} float Fraction of sentences with length at or above the threshold

Usage Examples

Basic Usage

from datatrove.pipeline.stats.sentence_stats import SentenceStats

stats = SentenceStats(
    output_folder="output/stats/",
)

Custom Thresholds and Language

from datatrove.pipeline.stats.sentence_stats import SentenceStats
from datatrove.utils.typeshelper import Languages

stats = SentenceStats(
    output_folder="output/stats/",
    short_sentence_max_chars_threshold=[10, 20, 50],
    long_sentence_max_chars_threshold=[100, 200],
    language=Languages.german,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment