Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove BaseStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Statistics
Last Updated 2026-02-14 17:00 GMT

Overview

BaseStats is the abstract base class for all statistics computation pipeline steps in Datatrove, providing the framework for extracting, grouping, and persisting document-level metrics.

Description

BaseStats extends PipelineStep and defines the core architecture for computing dataset statistics. It introduces an abstract method extract_stats that subclasses must implement to compute specific metrics from a Document. The base class then handles all the infrastructure for grouping, aggregation, and serialization of those metrics.

Statistics are organized into four group types: summary (aggregate totals), histogram (value distributions with configurable rounding), fqdn (grouped by fully qualified domain name extracted from document URLs), and suffix (grouped by TLD suffix). For each group, the class maintains a dictionary of MetricStatsDict objects that accumulate running statistics. The histogram group specially tracks counts weighted by character count and optionally by token count.

The class also supports a TopKConfig mechanism for memory management. When computing statistics for high-cardinality groups like fqdn or suffix, only the top K keys (by document count) are retained, which prevents unbounded memory growth. Results are serialized to JSON files organized as {group}/{stat_name}/{rank:05d}.json, enabling distributed computation where each worker writes to its own rank file.

Usage

Do not instantiate BaseStats directly. Instead, subclass it and implement the extract_stats method to define the specific metrics to compute. Use it as the foundation for any custom statistics computation step in a Datatrove pipeline.

Code Reference

Source Location

Signature

class BaseStats(PipelineStep):
    type = "📊 - STATS"
    name = "👑 Summary stats"
    _requires_dependencies = ["tldextract"]

    def __init__(
        self,
        output_folder: DataFolderLike,
        groups_to_compute: list[GROUP] | None = None,
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    @abstractmethod
    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

    def get_kv(
        self, doc: Document, value: STAT_TYPE, group_name: GROUP
    ) -> tuple[str, STAT_TYPE | dict[str, STAT_TYPE]]:
        ...

    def run(
        self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1
    ) -> DocumentsPipeline:
        ...

Import

from datatrove.pipeline.stats.base import BaseStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics JSON files will be saved
groups_to_compute list[GROUP] or None No List of grouping strategies: "summary", "histogram", "fqdn", "suffix" (default: all)
histogram_round_digits int No Number of decimal digits for rounding histogram bin values (default: 3)
top_k_config TopKConfig No Configuration controlling top-K truncation for high-cardinality groups (default: top 100,000 for fqdn and suffix)

Outputs

Name Type Description
documents DocumentsPipeline Yields documents with computed statistics added to their metadata
JSON files Files on disk Statistics saved as {group}/{stat_name}/{rank:05d}.json in the output folder

Usage Examples

Subclassing BaseStats

from datatrove.data import Document
from datatrove.pipeline.stats.base import BaseStats


class WordCountStats(BaseStats):
    name = "Word count stats"

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        return {
            "word_count": len(doc.text.split()),
        }

Using in a Pipeline

from datatrove.pipeline.stats.base import BaseStats

# Any BaseStats subclass can be added to a pipeline
stats_step = WordCountStats(
    output_folder="s3://my-bucket/stats/",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment