Implementation:Huggingface Datatrove BaseStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Statistics |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
BaseStats is the abstract base class for all statistics computation pipeline steps in Datatrove, providing the framework for extracting, grouping, and persisting document-level metrics.
Description
BaseStats extends PipelineStep and defines the core architecture for computing dataset statistics. It introduces an abstract method extract_stats that subclasses must implement to compute specific metrics from a Document. The base class then handles all the infrastructure for grouping, aggregation, and serialization of those metrics.
Statistics are organized into four group types: summary (aggregate totals), histogram (value distributions with configurable rounding), fqdn (grouped by fully qualified domain name extracted from document URLs), and suffix (grouped by TLD suffix). For each group, the class maintains a dictionary of MetricStatsDict objects that accumulate running statistics. The histogram group specially tracks counts weighted by character count and optionally by token count.
The class also supports a TopKConfig mechanism for memory management. When computing statistics for high-cardinality groups like fqdn or suffix, only the top K keys (by document count) are retained, which prevents unbounded memory growth. Results are serialized to JSON files organized as {group}/{stat_name}/{rank:05d}.json, enabling distributed computation where each worker writes to its own rank file.
Usage
Do not instantiate BaseStats directly. Instead, subclass it and implement the extract_stats method to define the specific metrics to compute. Use it as the foundation for any custom statistics computation step in a Datatrove pipeline.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/base.py
- Lines: 1-133
Signature
class BaseStats(PipelineStep):
type = "📊 - STATS"
name = "👑 Summary stats"
_requires_dependencies = ["tldextract"]
def __init__(
self,
output_folder: DataFolderLike,
groups_to_compute: list[GROUP] | None = None,
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
@abstractmethod
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
def get_kv(
self, doc: Document, value: STAT_TYPE, group_name: GROUP
) -> tuple[str, STAT_TYPE | dict[str, STAT_TYPE]]:
...
def run(
self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1
) -> DocumentsPipeline:
...
Import
from datatrove.pipeline.stats.base import BaseStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics JSON files will be saved |
| groups_to_compute | list[GROUP] or None | No | List of grouping strategies: "summary", "histogram", "fqdn", "suffix" (default: all) |
| histogram_round_digits | int | No | Number of decimal digits for rounding histogram bin values (default: 3) |
| top_k_config | TopKConfig | No | Configuration controlling top-K truncation for high-cardinality groups (default: top 100,000 for fqdn and suffix) |
Outputs
| Name | Type | Description |
|---|---|---|
| documents | DocumentsPipeline | Yields documents with computed statistics added to their metadata |
| JSON files | Files on disk | Statistics saved as {group}/{stat_name}/{rank:05d}.json in the output folder |
Usage Examples
Subclassing BaseStats
from datatrove.data import Document
from datatrove.pipeline.stats.base import BaseStats
class WordCountStats(BaseStats):
name = "Word count stats"
def extract_stats(self, doc: Document) -> dict[str, int | float]:
return {
"word_count": len(doc.text.split()),
}
Using in a Pipeline
from datatrove.pipeline.stats.base import BaseStats
# Any BaseStats subclass can be added to a pipeline
stats_step = WordCountStats(
output_folder="s3://my-bucket/stats/",
groups_to_compute=["summary", "histogram"],
)