Implementation:Huggingface Datatrove BaseStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Statistics
Last Updated	2026-02-14 17:00 GMT

Overview

BaseStats is the abstract base class for all statistics computation pipeline steps in Datatrove, providing the framework for extracting, grouping, and persisting document-level metrics.

Description

BaseStats extends PipelineStep and defines the core architecture for computing dataset statistics. It introduces an abstract method extract_stats that subclasses must implement to compute specific metrics from a Document. The base class then handles all the infrastructure for grouping, aggregation, and serialization of those metrics.

Statistics are organized into four group types: summary (aggregate totals), histogram (value distributions with configurable rounding), fqdn (grouped by fully qualified domain name extracted from document URLs), and suffix (grouped by TLD suffix). For each group, the class maintains a dictionary of MetricStatsDict objects that accumulate running statistics. The histogram group specially tracks counts weighted by character count and optionally by token count.

The class also supports a TopKConfig mechanism for memory management. When computing statistics for high-cardinality groups like fqdn or suffix, only the top K keys (by document count) are retained, which prevents unbounded memory growth. Results are serialized to JSON files organized as {group}/{stat_name}/{rank:05d}.json, enabling distributed computation where each worker writes to its own rank file.

Usage

Do not instantiate BaseStats directly. Instead, subclass it and implement the extract_stats method to define the specific metrics to compute. Use it as the foundation for any custom statistics computation step in a Datatrove pipeline.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/base.py
Lines: 1-133

Signature

class BaseStats(PipelineStep):
    type = "📊 - STATS"
    name = "👑 Summary stats"
    _requires_dependencies = ["tldextract"]

    def __init__(
        self,
        output_folder: DataFolderLike,
        groups_to_compute: list[GROUP] | None = None,
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    @abstractmethod
    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

    def get_kv(
        self, doc: Document, value: STAT_TYPE, group_name: GROUP
    ) -> tuple[str, STAT_TYPE | dict[str, STAT_TYPE]]:
        ...

    def run(
        self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1
    ) -> DocumentsPipeline:
        ...

Import

from datatrove.pipeline.stats.base import BaseStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics JSON files will be saved
groups_to_compute	list[GROUP] or None	No	List of grouping strategies: "summary", "histogram", "fqdn", "suffix" (default: all)
histogram_round_digits	int	No	Number of decimal digits for rounding histogram bin values (default: 3)
top_k_config	TopKConfig	No	Configuration controlling top-K truncation for high-cardinality groups (default: top 100,000 for fqdn and suffix)

Outputs

Name	Type	Description
documents	DocumentsPipeline	Yields documents with computed statistics added to their metadata
JSON files	Files on disk	Statistics saved as {group}/{stat_name}/{rank:05d}.json in the output folder

Usage Examples

Subclassing BaseStats

from datatrove.data import Document
from datatrove.pipeline.stats.base import BaseStats


class WordCountStats(BaseStats):
    name = "Word count stats"

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        return {
            "word_count": len(doc.text.split()),
        }

Using in a Pipeline

from datatrove.pipeline.stats.base import BaseStats

# Any BaseStats subclass can be added to a pipeline
stats_step = WordCountStats(
    output_folder="s3://my-bucket/stats/",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Principle:Huggingface_Datatrove_Statistics_Collection_Framework

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment