Implementation:Huggingface Datatrove C4QualityFilter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove, C4 / T5 (Raffel et al. 2020) | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Filter implementation that applies C4 paper heuristic rules at both line and document levels, removing boilerplate lines while preserving usable content, and rejecting entire documents that contain code or placeholder text.
Description
C4QualityFilter extends BaseFilter and implements a two-phase filtering approach. First, it iterates over all lines (split by paragraph or sentence), applying line-level checks and keeping only lines that pass. Then it applies document-level checks on the filtered result. The document text is modified in place -- after filtering, doc.text contains only the surviving lines joined together. This is unlike most other filters which only accept or reject documents without modifying their content.
Usage
Used as supplementary quality filtering in production pipelines alongside Gopher filters. Particularly effective for cleaning web crawl data where pages mix article text with navigation, cookie notices, and JavaScript-related content.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/c4_filters.py (L27-136)
Signature:
class C4QualityFilter(BaseFilter):
def __init__(
self,
exclusion_writer: DiskWriter = None,
split_paragraph: bool = True,
remove_citations: bool = True,
filter_no_terminal_punct: bool = True,
min_num_sentences: int = 5,
min_words_per_line: int = 3,
max_word_length: int = 1000,
filter_lorem_ipsum: bool = True,
filter_javascript: bool = True,
filter_curly_bracket: bool = True,
filter_policy: bool = True,
language: str = Languages.english,
):
Import:
from datatrove.pipeline.filters import C4QualityFilter
Key Constants:
CITATION_REGEX = re.compile(r"\[\d*]|\[edit]|\[citation needed]")
END_PUNCTUATION = (".", "?", "!", '"', "'")
ELLIPSIS = "..."
POLICY_SUBSTRINGS = [
"terms of use", "privacy policy", "cookie policy",
"uses cookies", "use of cookies", "use cookies",
]
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| exclusion_writer | DiskWriter or None | No | Writer to save excluded documents (default None) |
| split_paragraph | bool | No | Split by newline (True) or by sentence tokenizer (False) (default True) |
| remove_citations | bool | No | Remove Wikipedia-style citations like [1], [edit] (default True) |
| filter_no_terminal_punct | bool | No | Remove lines without terminal punctuation (default True) |
| min_num_sentences | int | No | Min sentences after filtering; set to -1 to disable (default 5) |
| min_words_per_line | int | No | Min words per line; lines below this are removed (default 3) |
| max_word_length | int | No | Max chars per word; lines with longer words are removed; -1 to disable (default 1000) |
| filter_lorem_ipsum | bool | No | Remove entire document if "lorem ipsum" found (default True) |
| filter_javascript | bool | No | Remove lines containing "javascript" (default True) |
| filter_curly_bracket | bool | No | Remove entire document if "{" found (default True) |
| filter_policy | bool | No | Remove lines with cookie/policy phrases (default True) |
| language | str | No | Language for sentence tokenization (default Languages.english) |
Filter Input: Document with plain text in doc.text
Filter Output: bool or tuple[bool, str] -- returns True if the document passes (and doc.text is modified to contain only surviving lines), or (False, reason_string) if the document is rejected entirely. Reason strings include:
| Reason String | Cause |
|---|---|
| "lorem_ipsum" | Document contains "lorem ipsum" |
| "curly_bracket" | Document contains "{" |
| "too_few_sentences" | Fewer than min_num_sentences after line filtering |
Important: This filter modifies doc.text in place. After the filter runs, the document text contains only the lines that passed all line-level checks, joined by newlines (if split_paragraph=True) or spaces (if split_paragraph=False).
Usage Examples
Example 1 -- Default C4 filtering:
from datatrove.pipeline.filters import C4QualityFilter
c4_filter = C4QualityFilter()
Example 2 -- Relaxed settings for shorter documents:
from datatrove.pipeline.filters import C4QualityFilter
c4_filter = C4QualityFilter(
min_num_sentences=2, # allow shorter docs
min_words_per_line=2, # allow shorter lines
filter_curly_bracket=False, # allow code content
)
Example 3 -- Sentence-level splitting instead of paragraph:
from datatrove.pipeline.filters import C4QualityFilter
c4_filter = C4QualityFilter(
split_paragraph=False, # use NLTK sentence tokenizer instead of newline split
)
Example 4 -- In a FineWeb-style pipeline:
from datatrove.pipeline.filters import (
C4QualityFilter,
GopherQualityFilter,
GopherRepetitionFilter,
)
from datatrove.pipeline.readers import JsonlReader
pipeline = [
JsonlReader("s3://my-bucket/input/"),
GopherRepetitionFilter(),
GopherQualityFilter(),
C4QualityFilter(), # applied after Gopher filters
]