Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Datatrove FineWeb Dataset Creation

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Huggingface_Datatrove_FineWeb_Dataset_Creation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP, LLM_Training
Last Updated 2026-02-14 17:00 GMT

Overview

Complete production pipeline for creating the FineWeb dataset from Common Crawl, combining text extraction, multi-stage quality filtering, MinHash deduplication, and PII removal.

Description

This workflow reproduces the full FineWeb dataset production pipeline as used by HuggingFace. It combines two major phases: (1) base processing of Common Crawl WARC files through URL filtering, text extraction, language filtering, and multi-layered quality heuristics (Gopher repetition, Gopher quality, C4 quality, FineWeb-specific quality), and (2) per-dump MinHash deduplication with PII scrubbing. The pipeline orchestrates multiple dependent Slurm jobs that process data at Common Crawl scale (8000+ tasks per dump) with automatic dependency management between stages.

Usage

Execute this workflow when you need to create a high-quality English web text dataset from Common Crawl at production scale, replicating the FineWeb methodology. This is the most comprehensive data pipeline in the repository, combining all filtering and deduplication capabilities into a single end-to-end process.

Execution Steps

Step 1: Read Common Crawl WARC Files

Ingest WARC archive files for a specific Common Crawl dump from the S3 bucket. Each WARC record contains the raw HTTP response including HTML from a crawled web page. Files are distributed across 8000 parallel tasks for processing at scale.

Key considerations:

  • Randomize start times (180 second window) to avoid S3 request storms
  • Attach dump identifier as default metadata for downstream traceability

Step 2: URL and Language Filtering

Apply URL blocklist filtering to remove known bad domains, then extract plain text from HTML using Trafilatura (with precision mode), then classify language and keep only English documents. Each filtering stage writes excluded documents to separate output folders organized by filter type and dump.

Key considerations:

  • Non-English documents are organized by language for potential separate processing
  • Each exclusion category (URL, language) gets its own output directory

Step 3: Quality Filtering Pipeline

Apply four layers of quality heuristics in sequence: Gopher repetition filter (duplicate lines/n-grams), Gopher quality filter (word counts, lengths, symbols), C4 quality filter (bad words, JavaScript, short paragraphs), and FineWeb-specific quality filter (punctuation lines, character duplication, newline ratios). Documents must pass all four filter layers to survive.

Key considerations:

  • C4 filter runs with terminal punctuation check disabled
  • FineWeb quality filter adds heuristics beyond Gopher and C4
  • Each filter stage saves rejected documents for auditing

Step 4: Write Base Processing Output

Serialize all surviving documents from the quality filtering chain to compressed JSONL files, organized by dump. This output becomes the input for the deduplication phase.

Step 5: Compute MinHash Signatures

Generate MinHash LSH signatures for each document in the base processing output. Uses SHA1 hashing at 64-bit precision with 14 buckets of 8 hashes each (5-gram shingles). Each of the 1000 tasks processes its shard and writes signature files.

Key considerations:

  • SHA1 with 64-bit precision provides better collision resistance than default
  • Configuration: 14 buckets, 8 hashes per bucket, 5-gram shingles
  • This stage depends on completion of base processing

Step 6: Bucket Matching and Clustering

Find matching document pairs within each LSH bucket, then merge all matches into a global union-find structure to form duplicate clusters. The bucket matching runs on 700 tasks (50 per bucket), followed by a single-task clustering step that requires high memory to hold the full duplicate graph.

Key considerations:

  • Bucket matching: 14 buckets x 50 workers = 700 tasks
  • Clustering runs on single high-memory task (200GB)
  • Stages are chained with Slurm dependency management

Step 7: Filter Duplicates and Remove PII

Re-read the original base processing output, remove all duplicate documents identified by the clustering stage, apply PII formatting (replacing email addresses and IP addresses with placeholders), and count tokens for statistics. Write the final deduplicated, PII-scrubbed output.

Key considerations:

  • Input reader must match the signature stage exactly
  • Token counting provides before/after dedup metrics
  • PII formatter handles emails and IP addresses

Execution Diagram

GitHub URL

Workflow Repository