Implementation:Huggingface Datatrove DocStats

Overview

DocStats is a pipeline step that computes document-level character composition metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated document-level statistics.

Signature

class DocStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import DocStats

Parameters

Parameter	Type	Default	Description
`output_folder`	`DataFolderLike`	(required)	Path or data folder where per-rank statistics JSON files will be written.
`groups_to_compute`	`list[GROUP]`	`["summary", "histogram", "fqdn", "suffix"]`	List of grouping strategies for aggregating statistics.
`histogram_round_digits`	`int`	`3`	Number of decimal digits to round histogram bin values to, controlling histogram granularity.
`top_k_config`	`TopKConfig`	`DEFAULT_TOP_K_CONFIG`	Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name	Description
`length`	Total number of characters in the document: `len(doc.text)`.
`white_space_ratio`	Fraction of characters that are whitespace: characters where `c.isspace()` is true.
`non_alpha_digit_ratio`	Fraction of characters that are neither alphabetic nor numeric: characters where both `c.isalpha()` and `c.isdigit()` are false.
`digit_ratio`	Fraction of characters that are digits: characters where `c.isdigit()` is true.
`uppercase_ratio`	Fraction of characters that are uppercase: characters where `c.isupper()` is true.
`elipsis_ratio`	Fraction of characters belonging to ellipsis patterns (`...` or the Unicode ellipsis character), matched via compiled regex.
`punctuation_ratio`	Fraction of characters belonging to punctuation marks from the `PUNCTUATION` set, matched via compiled regex.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

DocStats has the simplest constructor of the stats classes, with no domain-specific configuration parameters beyond those inherited from BaseStats.
Ellipsis matching uses a precompiled regex that matches both ... (three periods) and the Unicode ellipsis character. The ratio is computed as the total number of characters consumed by all matches divided by document length.
Punctuation matching uses a precompiled regex built from the PUNCTUATION list imported from datatrove.utils.text. Like the ellipsis ratio, it counts total characters consumed by matches.
All ratios use len(doc.text) as the denominator, making them directly comparable across documents of different lengths.
The ELIPSIS constant is defined at module level as ["...", "\u2026"], matching both ASCII and Unicode ellipsis representations.

Example Usage

from datatrove.pipeline.stats import DocStats

# Basic usage with default settings
doc_stats = DocStats(
    output_folder="s3://my-bucket/stats/doc_stats",
)

# Only compute summary and histogram groups
doc_stats = DocStats(
    output_folder="/data/stats/doc_stats",
    groups_to_compute=["summary", "histogram"],
    histogram_round_digits=2,
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment