Workflow:Huggingface Datatrove FineWeb Dataset Creation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, LLM_Training |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Complete production pipeline for creating the FineWeb dataset from Common Crawl, combining text extraction, multi-stage quality filtering, MinHash deduplication, and PII removal.
Description
This workflow reproduces the full FineWeb dataset production pipeline as used by HuggingFace. It combines two major phases: (1) base processing of Common Crawl WARC files through URL filtering, text extraction, language filtering, and multi-layered quality heuristics (Gopher repetition, Gopher quality, C4 quality, FineWeb-specific quality), and (2) per-dump MinHash deduplication with PII scrubbing. The pipeline orchestrates multiple dependent Slurm jobs that process data at Common Crawl scale (8000+ tasks per dump) with automatic dependency management between stages.
Usage
Execute this workflow when you need to create a high-quality English web text dataset from Common Crawl at production scale, replicating the FineWeb methodology. This is the most comprehensive data pipeline in the repository, combining all filtering and deduplication capabilities into a single end-to-end process.
Execution Steps
Step 1: Read Common Crawl WARC Files
Ingest WARC archive files for a specific Common Crawl dump from the S3 bucket. Each WARC record contains the raw HTTP response including HTML from a crawled web page. Files are distributed across 8000 parallel tasks for processing at scale.
Key considerations:
- Randomize start times (180 second window) to avoid S3 request storms
- Attach dump identifier as default metadata for downstream traceability
Step 2: URL and Language Filtering
Apply URL blocklist filtering to remove known bad domains, then extract plain text from HTML using Trafilatura (with precision mode), then classify language and keep only English documents. Each filtering stage writes excluded documents to separate output folders organized by filter type and dump.
Key considerations:
- Non-English documents are organized by language for potential separate processing
- Each exclusion category (URL, language) gets its own output directory
Step 3: Quality Filtering Pipeline
Apply four layers of quality heuristics in sequence: Gopher repetition filter (duplicate lines/n-grams), Gopher quality filter (word counts, lengths, symbols), C4 quality filter (bad words, JavaScript, short paragraphs), and FineWeb-specific quality filter (punctuation lines, character duplication, newline ratios). Documents must pass all four filter layers to survive.
Key considerations:
- C4 filter runs with terminal punctuation check disabled
- FineWeb quality filter adds heuristics beyond Gopher and C4
- Each filter stage saves rejected documents for auditing
Step 4: Write Base Processing Output
Serialize all surviving documents from the quality filtering chain to compressed JSONL files, organized by dump. This output becomes the input for the deduplication phase.
Step 5: Compute MinHash Signatures
Generate MinHash LSH signatures for each document in the base processing output. Uses SHA1 hashing at 64-bit precision with 14 buckets of 8 hashes each (5-gram shingles). Each of the 1000 tasks processes its shard and writes signature files.
Key considerations:
- SHA1 with 64-bit precision provides better collision resistance than default
- Configuration: 14 buckets, 8 hashes per bucket, 5-gram shingles
- This stage depends on completion of base processing
Step 6: Bucket Matching and Clustering
Find matching document pairs within each LSH bucket, then merge all matches into a global union-find structure to form duplicate clusters. The bucket matching runs on 700 tasks (50 per bucket), followed by a single-task clustering step that requires high memory to hold the full duplicate graph.
Key considerations:
- Bucket matching: 14 buckets x 50 workers = 700 tasks
- Clustering runs on single high-memory task (200GB)
- Stages are chained with Slurm dependency management
Step 7: Filter Duplicates and Remove PII
Re-read the original base processing output, remove all duplicate documents identified by the clustering stage, apply PII formatting (replacing email addresses and IP addresses with placeholders), and count tokens for statistics. Write the final deduplicated, PII-scrubbed output.
Key considerations:
- Input reader must match the signature stage exactly
- Token counting provides before/after dedup metrics
- PII formatter handles emails and IP addresses