Workflow:Huggingface Datatrove Common Crawl Processing
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Web_Crawling |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end process for extracting clean text from Common Crawl WARC archives using URL filtering, HTML text extraction, language detection, and quality filtering.
Description
This workflow processes raw Common Crawl web archive (WARC) files into high-quality English text data suitable for language model training. It reads WARC records containing raw HTML, applies URL-based filtering to remove blocked domains, extracts plain text from HTML using Trafilatura, identifies and keeps only English-language documents, and applies Gopher-derived quality and repetition heuristics to remove low-quality content. The output is clean, filtered JSONL data organized by dump identifier.
Usage
Execute this workflow when you have access to a Common Crawl dump (e.g., CC-MAIN-2023-50) and need to produce a clean English text corpus from it. This is typically the first stage of a larger data processing pipeline for LLM pretraining data, before deduplication and tokenization.
Execution Steps
Step 1: Read WARC Archives
Ingest raw WARC files from the Common Crawl S3 bucket for a specific dump. Each WARC record contains the HTTP response (including HTML) from a single web page crawl. The reader distributes files across parallel tasks using shard-based splitting so each worker processes a non-overlapping subset of WARC files.
Key considerations:
- Use the glob pattern to select only WARC files (not WAT or WET)
- Attach dump metadata to each document for downstream traceability
- Randomize start times across workers to avoid S3 request storms
Step 2: URL Filtering
Remove documents whose source URLs match known blocklists. This step checks against lists of banned domains, banned URL substrings, and banned words/subwords in URLs. Documents that fail URL filtering are optionally written to an exclusion output for auditing.
Key considerations:
- Blocklists cover adult content, spam, and low-quality domains
- Exclusion writers preserve removed documents for review
Step 3: HTML Text Extraction
Extract clean plain text from raw HTML using the Trafilatura library. The extractor strips boilerplate (navigation, headers, footers, ads) and retains the main content. The extraction runs in a sandboxed subprocess to prevent memory leaks from crashing the pipeline.
Key considerations:
- Favour precision mode reduces false positives (keeps less but higher quality text)
- Process isolation protects against third-party library crashes
Step 4: Language Filtering
Classify each document by language using a FastText language identification model and keep only English documents. Non-English documents are routed to language-specific output folders for potential separate processing.
Key considerations:
- Documents below the confidence threshold are discarded
- Non-English documents are saved organized by detected language and dump
Step 5: Repetition Filtering
Apply Gopher repetition heuristics to remove documents with excessive duplicate content. This checks for repeated lines, repeated paragraphs, and repeated n-grams (character-level and word-level), removing documents that exceed the thresholds from DeepMind's Gopher paper Table A1.
Key considerations:
- Checks duplicate line fractions and duplicate paragraph fractions
- Measures character-level repetition in top n-grams (2 through 10)
Step 6: Quality Filtering
Apply Gopher quality heuristics to filter documents based on structural indicators of quality. This checks word count bounds, average word length, symbol-to-word ratios, presence of stop words, and other quality signals derived from the Gopher paper.
Key considerations:
- Documents too short or too long are removed
- Low stop-word presence indicates non-natural-language content
Step 7: Write Filtered Output
Serialize the surviving documents to compressed JSONL files organized by dump identifier. Each task writes its output to a separate file using its rank as the filename to avoid write conflicts.