Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Datatrove Common Crawl Processing

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP, Web_Crawling
Last Updated 2026-02-14 17:00 GMT

Overview

End-to-end process for extracting clean text from Common Crawl WARC archives using URL filtering, HTML text extraction, language detection, and quality filtering.

Description

This workflow processes raw Common Crawl web archive (WARC) files into high-quality English text data suitable for language model training. It reads WARC records containing raw HTML, applies URL-based filtering to remove blocked domains, extracts plain text from HTML using Trafilatura, identifies and keeps only English-language documents, and applies Gopher-derived quality and repetition heuristics to remove low-quality content. The output is clean, filtered JSONL data organized by dump identifier.

Usage

Execute this workflow when you have access to a Common Crawl dump (e.g., CC-MAIN-2023-50) and need to produce a clean English text corpus from it. This is typically the first stage of a larger data processing pipeline for LLM pretraining data, before deduplication and tokenization.

Execution Steps

Step 1: Read WARC Archives

Ingest raw WARC files from the Common Crawl S3 bucket for a specific dump. Each WARC record contains the HTTP response (including HTML) from a single web page crawl. The reader distributes files across parallel tasks using shard-based splitting so each worker processes a non-overlapping subset of WARC files.

Key considerations:

  • Use the glob pattern to select only WARC files (not WAT or WET)
  • Attach dump metadata to each document for downstream traceability
  • Randomize start times across workers to avoid S3 request storms

Step 2: URL Filtering

Remove documents whose source URLs match known blocklists. This step checks against lists of banned domains, banned URL substrings, and banned words/subwords in URLs. Documents that fail URL filtering are optionally written to an exclusion output for auditing.

Key considerations:

  • Blocklists cover adult content, spam, and low-quality domains
  • Exclusion writers preserve removed documents for review

Step 3: HTML Text Extraction

Extract clean plain text from raw HTML using the Trafilatura library. The extractor strips boilerplate (navigation, headers, footers, ads) and retains the main content. The extraction runs in a sandboxed subprocess to prevent memory leaks from crashing the pipeline.

Key considerations:

  • Favour precision mode reduces false positives (keeps less but higher quality text)
  • Process isolation protects against third-party library crashes

Step 4: Language Filtering

Classify each document by language using a FastText language identification model and keep only English documents. Non-English documents are routed to language-specific output folders for potential separate processing.

Key considerations:

  • Documents below the confidence threshold are discarded
  • Non-English documents are saved organized by detected language and dump

Step 5: Repetition Filtering

Apply Gopher repetition heuristics to remove documents with excessive duplicate content. This checks for repeated lines, repeated paragraphs, and repeated n-grams (character-level and word-level), removing documents that exceed the thresholds from DeepMind's Gopher paper Table A1.

Key considerations:

  • Checks duplicate line fractions and duplicate paragraph fractions
  • Measures character-level repetition in top n-grams (2 through 10)

Step 6: Quality Filtering

Apply Gopher quality heuristics to filter documents based on structural indicators of quality. This checks word count bounds, average word length, symbol-to-word ratios, presence of stop words, and other quality signals derived from the Gopher paper.

Key considerations:

  • Documents too short or too long are removed
  • Low stop-word presence indicates non-natural-language content

Step 7: Write Filtered Output

Serialize the surviving documents to compressed JSONL files organized by dump identifier. Each task writes its output to a separate file using its rank as the filename to avoid write conflicts.

Execution Diagram

GitHub URL

Workflow Repository