Principle:Huggingface Datatrove C4 Quality Filtering
| Sources | Domains | Last Updated |
|---|---|---|
| C4 / T5 (Raffel et al. 2020) | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Line-level and document-level quality filtering rules from the C4 (Colossal Clean Crawled Corpus) paper, designed to clean web-crawled text by removing boilerplate lines while preserving high-quality content.
Description
C4 quality filtering operates at both line and document levels, applying a series of heuristic rules:
Line-level filters (lines failing these checks are removed, but the document is retained with remaining lines):
- Terminal punctuation: Remove lines that do not end in a terminal punctuation mark (., ?, !, ", Template:'}). Lines ending with ellipsis (...) are also removed.
- Minimum words per line: Remove lines containing fewer than 3 words.
- Maximum word length: Remove lines where any single word exceeds 1000 characters.
- JavaScript mentions: Remove lines containing the word "javascript" (case-insensitive).
- Policy phrases: Remove lines containing cookie/privacy policy phrases such as "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", or "use cookies".
Document-level filters (these cause the entire document to be removed):
- Minimum sentence count: Documents must contain at least 5 sentences after line filtering.
- Lorem ipsum: Documents containing "lorem ipsum" are removed entirely.
- Curly brackets: Documents containing a curly bracket ({) are removed entirely, as this typically indicates code or template content.
Text modification:
- Citation removal: Wikipedia-style citations (e.g., [1], [edit], [citation needed]) are stripped from the text.
- Line reassembly: After line-level filtering, the document text is replaced with only the lines that passed all checks.
Usage
Used as supplementary quality filtering in production pipelines (e.g., FineWeb) alongside Gopher filters. C4 filtering is particularly effective at removing navigation boilerplate and non-content lines while preserving the usable portions of a page, unlike document-level-only filters which must accept or reject entire documents.
Theoretical Basis
The heuristic rules originate from the T5/C4 paper (Raffel et al., 2020). The key design insight is line-level filtering: rather than accepting or rejecting entire documents, C4 filters individual lines and reconstructs the document from surviving lines. This approach:
- Preserves good content: A page with a few boilerplate lines (navigation, cookie notices) can still contribute its substantive content to the corpus.
- Removes navigation and boilerplate: Lines lacking terminal punctuation, containing JavaScript references, or mentioning cookie policies are almost always non-content elements.
- Handles mixed-quality pages: Many web pages contain a mix of high-quality article text and low-quality navigation/footer text. Line-level filtering separates these effectively.
The document-level rules (lorem ipsum, curly brackets, minimum sentences) handle cases where the entire page is low quality and no amount of line filtering can salvage it.
The reference implementation is based on the TensorFlow Datasets C4 utilities: [1].