Workflow:Huggingface Datatrove Summary Statistics

Knowledge Sources	Datatrove
Domains	Data_Engineering, NLP, Data_Analysis
Last Updated	2026-02-14 17:00 GMT

Overview

Two-stage pipeline for computing distributed dataset summary statistics (word, line, document, language, token, perplexity metrics) and merging per-shard results into consolidated metric files.

Description

This workflow profiles a text dataset by computing a comprehensive set of statistics at the word, line, paragraph, sentence, document, and token levels. It operates in two phases: a distributed compute phase where each parallel task reads a shard of documents, optionally samples a fraction, and accumulates statistics into groupings (summary, FQDN, URL suffix, or histogram), and a merge phase that consolidates per-task statistics into final metric JSON files. The statistics framework supports top-k tracking per grouping and outputs structured MetricStatsDict objects for downstream analysis.

Usage

Execute this workflow when you need to profile a dataset before or after processing to understand its characteristics: text lengths, word distributions, language composition, quality indicators, or to compare datasets. Sampling is supported for large datasets where computing exact statistics would be too expensive.

Execution Steps

Step 1: Read Input Data

Load the dataset to profile from local storage, S3, or the HuggingFace Hub. Documents are distributed across parallel tasks for concurrent statistics collection. An optional sampling filter can reduce the number of documents processed for large datasets.

Key considerations:

Use SamplerFilter to process a random subset of a large dataset
The reader distributes files across tasks via shard-based splitting
Input can be JSONL, Parquet, or any supported reader format

Step 2: Compute Statistics

Each task iterates over its shard of documents and computes the configured statistics. Multiple stat blocks can run in the same pipeline to collect different metric categories simultaneously. Each stat block writes per-task JSON files to the output folder, organized by grouping type (summary, FQDN, suffix, histogram) and stat name.

Available stat modules:

WordStats: word count, average word length, stop word ratio, type-token ratio
LineStats: line count, average line length, duplicate line ratios, bullet point ratios
DocStats: document length, whitespace ratio, digit ratio, uppercase ratio, punctuation ratio
ParagraphStats: paragraph count, average length, short/long paragraph ratios
SentenceStats: sentence count, average length, short/long sentence ratios
TokenStats: token count using a HuggingFace tokenizer
LangStats: FastText language identification scores
PerplexityStats: KenLM perplexity scores
ContaminationStats: contamination word frequency

Key considerations:

Top-k configuration controls how many top groups are tracked per metric
Groupings include: summary (global), fqdn (domain), suffix (URL path), histogram (value-based)
Per-task JSON files follow the pattern: output_folder/{grouping}/{stat_name}/{rank}.json

Step 3: Merge Statistics

A merge step reads all per-task statistics files for each metric and combines them into a single consolidated metric.json file per stat per grouping. The merger supports top-k filtering to keep only the most significant groups in the merged output. This stage depends on the compute stage completing.

Key considerations:

Top-k can differ between compute and merge phases (e.g., compute top 10000, merge top 8000)
Output is organized as: output_folder/{grouping}/{stat_name}/metric.json
MetricStatsDict provides mean, total, count, min, max for each group
Optional removal of input per-task files after successful merge

Execution Diagram

GitHub URL

Workflow Repository