Workflow:Huggingface Datatrove Summary Statistics
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Data_Analysis |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Two-stage pipeline for computing distributed dataset summary statistics (word, line, document, language, token, perplexity metrics) and merging per-shard results into consolidated metric files.
Description
This workflow profiles a text dataset by computing a comprehensive set of statistics at the word, line, paragraph, sentence, document, and token levels. It operates in two phases: a distributed compute phase where each parallel task reads a shard of documents, optionally samples a fraction, and accumulates statistics into groupings (summary, FQDN, URL suffix, or histogram), and a merge phase that consolidates per-task statistics into final metric JSON files. The statistics framework supports top-k tracking per grouping and outputs structured MetricStatsDict objects for downstream analysis.
Usage
Execute this workflow when you need to profile a dataset before or after processing to understand its characteristics: text lengths, word distributions, language composition, quality indicators, or to compare datasets. Sampling is supported for large datasets where computing exact statistics would be too expensive.
Execution Steps
Step 1: Read Input Data
Load the dataset to profile from local storage, S3, or the HuggingFace Hub. Documents are distributed across parallel tasks for concurrent statistics collection. An optional sampling filter can reduce the number of documents processed for large datasets.
Key considerations:
- Use SamplerFilter to process a random subset of a large dataset
- The reader distributes files across tasks via shard-based splitting
- Input can be JSONL, Parquet, or any supported reader format
Step 2: Compute Statistics
Each task iterates over its shard of documents and computes the configured statistics. Multiple stat blocks can run in the same pipeline to collect different metric categories simultaneously. Each stat block writes per-task JSON files to the output folder, organized by grouping type (summary, FQDN, suffix, histogram) and stat name.
Available stat modules:
- WordStats: word count, average word length, stop word ratio, type-token ratio
- LineStats: line count, average line length, duplicate line ratios, bullet point ratios
- DocStats: document length, whitespace ratio, digit ratio, uppercase ratio, punctuation ratio
- ParagraphStats: paragraph count, average length, short/long paragraph ratios
- SentenceStats: sentence count, average length, short/long sentence ratios
- TokenStats: token count using a HuggingFace tokenizer
- LangStats: FastText language identification scores
- PerplexityStats: KenLM perplexity scores
- ContaminationStats: contamination word frequency
Key considerations:
- Top-k configuration controls how many top groups are tracked per metric
- Groupings include: summary (global), fqdn (domain), suffix (URL path), histogram (value-based)
- Per-task JSON files follow the pattern: output_folder/{grouping}/{stat_name}/{rank}.json
Step 3: Merge Statistics
A merge step reads all per-task statistics files for each metric and combines them into a single consolidated metric.json file per stat per grouping. The merger supports top-k filtering to keep only the most significant groups in the merged output. This stage depends on the compute stage completing.
Key considerations:
- Top-k can differ between compute and merge phases (e.g., compute top 10000, merge top 8000)
- Output is organized as: output_folder/{grouping}/{stat_name}/metric.json
- MetricStatsDict provides mean, total, count, min, max for each group
- Optional removal of input per-task files after successful merge