Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Huggingface Datatrove Summary Statistics

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP, Data_Analysis
Last Updated 2026-02-14 17:00 GMT

Overview

Two-stage pipeline for computing distributed dataset summary statistics (word, line, document, language, token, perplexity metrics) and merging per-shard results into consolidated metric files.

Description

This workflow profiles a text dataset by computing a comprehensive set of statistics at the word, line, paragraph, sentence, document, and token levels. It operates in two phases: a distributed compute phase where each parallel task reads a shard of documents, optionally samples a fraction, and accumulates statistics into groupings (summary, FQDN, URL suffix, or histogram), and a merge phase that consolidates per-task statistics into final metric JSON files. The statistics framework supports top-k tracking per grouping and outputs structured MetricStatsDict objects for downstream analysis.

Usage

Execute this workflow when you need to profile a dataset before or after processing to understand its characteristics: text lengths, word distributions, language composition, quality indicators, or to compare datasets. Sampling is supported for large datasets where computing exact statistics would be too expensive.

Execution Steps

Step 1: Read Input Data

Load the dataset to profile from local storage, S3, or the HuggingFace Hub. Documents are distributed across parallel tasks for concurrent statistics collection. An optional sampling filter can reduce the number of documents processed for large datasets.

Key considerations:

  • Use SamplerFilter to process a random subset of a large dataset
  • The reader distributes files across tasks via shard-based splitting
  • Input can be JSONL, Parquet, or any supported reader format

Step 2: Compute Statistics

Each task iterates over its shard of documents and computes the configured statistics. Multiple stat blocks can run in the same pipeline to collect different metric categories simultaneously. Each stat block writes per-task JSON files to the output folder, organized by grouping type (summary, FQDN, suffix, histogram) and stat name.

Available stat modules:

  • WordStats: word count, average word length, stop word ratio, type-token ratio
  • LineStats: line count, average line length, duplicate line ratios, bullet point ratios
  • DocStats: document length, whitespace ratio, digit ratio, uppercase ratio, punctuation ratio
  • ParagraphStats: paragraph count, average length, short/long paragraph ratios
  • SentenceStats: sentence count, average length, short/long sentence ratios
  • TokenStats: token count using a HuggingFace tokenizer
  • LangStats: FastText language identification scores
  • PerplexityStats: KenLM perplexity scores
  • ContaminationStats: contamination word frequency

Key considerations:

  • Top-k configuration controls how many top groups are tracked per metric
  • Groupings include: summary (global), fqdn (domain), suffix (URL path), histogram (value-based)
  • Per-task JSON files follow the pattern: output_folder/{grouping}/{stat_name}/{rank}.json

Step 3: Merge Statistics

A merge step reads all per-task statistics files for each metric and combines them into a single consolidated metric.json file per stat per grouping. The merger supports top-k filtering to keep only the most significant groups in the merged output. This stage depends on the compute stage completing.

Key considerations:

  • Top-k can differ between compute and merge phases (e.g., compute top 10000, merge top 8000)
  • Output is organized as: output_folder/{grouping}/{stat_name}/metric.json
  • MetricStatsDict provides mean, total, count, min, max for each group
  • Optional removal of input per-task files after successful merge

Execution Diagram

GitHub URL

Workflow Repository