Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove DocStats

From Leeroopedia

Template:Metadata

Overview

DocStats is a pipeline step that computes document-level character composition metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated document-level statistics.

Signature

class DocStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import DocStats

Parameters

Parameter Type Default Description
output_folder DataFolderLike (required) Path or data folder where per-rank statistics JSON files will be written.
groups_to_compute list[GROUP] ["summary", "histogram", "fqdn", "suffix"] List of grouping strategies for aggregating statistics.
histogram_round_digits int 3 Number of decimal digits to round histogram bin values to, controlling histogram granularity.
top_k_config TopKConfig DEFAULT_TOP_K_CONFIG Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name Description
length Total number of characters in the document: len(doc.text).
white_space_ratio Fraction of characters that are whitespace: characters where c.isspace() is true.
non_alpha_digit_ratio Fraction of characters that are neither alphabetic nor numeric: characters where both c.isalpha() and c.isdigit() are false.
digit_ratio Fraction of characters that are digits: characters where c.isdigit() is true.
uppercase_ratio Fraction of characters that are uppercase: characters where c.isupper() is true.
elipsis_ratio Fraction of characters belonging to ellipsis patterns (... or the Unicode ellipsis character), matched via compiled regex.
punctuation_ratio Fraction of characters belonging to punctuation marks from the PUNCTUATION set, matched via compiled regex.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

  • DocStats has the simplest constructor of the stats classes, with no domain-specific configuration parameters beyond those inherited from BaseStats.
  • Ellipsis matching uses a precompiled regex that matches both ... (three periods) and the Unicode ellipsis character. The ratio is computed as the total number of characters consumed by all matches divided by document length.
  • Punctuation matching uses a precompiled regex built from the PUNCTUATION list imported from datatrove.utils.text. Like the ellipsis ratio, it counts total characters consumed by matches.
  • All ratios use len(doc.text) as the denominator, making them directly comparable across documents of different lengths.
  • The ELIPSIS constant is defined at module level as ["...", "\u2026"], matching both ASCII and Unicode ellipsis representations.

Example Usage

from datatrove.pipeline.stats import DocStats

# Basic usage with default settings
doc_stats = DocStats(
    output_folder="s3://my-bucket/stats/doc_stats",
)

# Only compute summary and histogram groups
doc_stats = DocStats(
    output_folder="/data/stats/doc_stats",
    groups_to_compute=["summary", "histogram"],
    histogram_round_digits=2,
)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment