Implementation:Huggingface Datatrove DocStats
Overview
DocStats is a pipeline step that computes document-level character composition metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated document-level statistics.
Signature
class DocStats(BaseStats):
def __init__(
self,
output_folder: DataFolderLike,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None:
Import
from datatrove.pipeline.stats import DocStats
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
output_folder |
DataFolderLike |
(required) | Path or data folder where per-rank statistics JSON files will be written. |
groups_to_compute |
list[GROUP] |
["summary", "histogram", "fqdn", "suffix"] |
List of grouping strategies for aggregating statistics. |
histogram_round_digits |
int |
3 |
Number of decimal digits to round histogram bin values to, controlling histogram granularity. |
top_k_config |
TopKConfig |
DEFAULT_TOP_K_CONFIG |
Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys. |
Available Stats
The following statistics are extracted per document via the extract_stats method:
| Stat Name | Description |
|---|---|
length |
Total number of characters in the document: len(doc.text).
|
white_space_ratio |
Fraction of characters that are whitespace: characters where c.isspace() is true.
|
non_alpha_digit_ratio |
Fraction of characters that are neither alphabetic nor numeric: characters where both c.isalpha() and c.isdigit() are false.
|
digit_ratio |
Fraction of characters that are digits: characters where c.isdigit() is true.
|
uppercase_ratio |
Fraction of characters that are uppercase: characters where c.isupper() is true.
|
elipsis_ratio |
Fraction of characters belonging to ellipsis patterns (... or the Unicode ellipsis character), matched via compiled regex.
|
punctuation_ratio |
Fraction of characters belonging to punctuation marks from the PUNCTUATION set, matched via compiled regex.
|
I/O
Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute and metadata containing a url field (required for fqdn/suffix grouping).
Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.
Key Implementation Details
- DocStats has the simplest constructor of the stats classes, with no domain-specific configuration parameters beyond those inherited from
BaseStats. - Ellipsis matching uses a precompiled regex that matches both
...(three periods) and the Unicode ellipsis character. The ratio is computed as the total number of characters consumed by all matches divided by document length. - Punctuation matching uses a precompiled regex built from the
PUNCTUATIONlist imported fromdatatrove.utils.text. Like the ellipsis ratio, it counts total characters consumed by matches. - All ratios use
len(doc.text)as the denominator, making them directly comparable across documents of different lengths. - The
ELIPSISconstant is defined at module level as["...", "\u2026"], matching both ASCII and Unicode ellipsis representations.
Example Usage
from datatrove.pipeline.stats import DocStats
# Basic usage with default settings
doc_stats = DocStats(
output_folder="s3://my-bucket/stats/doc_stats",
)
# Only compute summary and histogram groups
doc_stats = DocStats(
output_folder="/data/stats/doc_stats",
groups_to_compute=["summary", "histogram"],
histogram_round_digits=2,
)