Implementation:Huggingface Datatrove C4QualityFilter

Sources	Domains	Last Updated
Huggingface Datatrove, C4 / T5 (Raffel et al. 2020)	Data_Quality, NLP	2026-02-14 00:00 GMT

Overview

Filter implementation that applies C4 paper heuristic rules at both line and document levels, removing boilerplate lines while preserving usable content, and rejecting entire documents that contain code or placeholder text.

Description

C4QualityFilter extends BaseFilter and implements a two-phase filtering approach. First, it iterates over all lines (split by paragraph or sentence), applying line-level checks and keeping only lines that pass. Then it applies document-level checks on the filtered result. The document text is modified in place -- after filtering, doc.text contains only the surviving lines joined together. This is unlike most other filters which only accept or reject documents without modifying their content.

Usage

Used as supplementary quality filtering in production pipelines alongside Gopher filters. Particularly effective for cleaning web crawl data where pages mix article text with navigation, cookie notices, and JavaScript-related content.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/c4_filters.py (L27-136)

Signature:

class C4QualityFilter(BaseFilter):
    def __init__(
        self,
        exclusion_writer: DiskWriter = None,
        split_paragraph: bool = True,
        remove_citations: bool = True,
        filter_no_terminal_punct: bool = True,
        min_num_sentences: int = 5,
        min_words_per_line: int = 3,
        max_word_length: int = 1000,
        filter_lorem_ipsum: bool = True,
        filter_javascript: bool = True,
        filter_curly_bracket: bool = True,
        filter_policy: bool = True,
        language: str = Languages.english,
    ):

Import:

from datatrove.pipeline.filters import C4QualityFilter

Key Constants:

CITATION_REGEX = re.compile(r"\[\d*]|\[edit]|\[citation needed]")
END_PUNCTUATION = (".", "?", "!", '"', "'")
ELLIPSIS = "..."
POLICY_SUBSTRINGS = [
    "terms of use", "privacy policy", "cookie policy",
    "uses cookies", "use of cookies", "use cookies",
]

I/O Contract

Inputs:

Parameter	Type	Required	Description
exclusion_writer	DiskWriter or None	No	Writer to save excluded documents (default None)
split_paragraph	bool	No	Split by newline (True) or by sentence tokenizer (False) (default True)
remove_citations	bool	No	Remove Wikipedia-style citations like [1], [edit] (default True)
filter_no_terminal_punct	bool	No	Remove lines without terminal punctuation (default True)
min_num_sentences	int	No	Min sentences after filtering; set to -1 to disable (default 5)
min_words_per_line	int	No	Min words per line; lines below this are removed (default 3)
max_word_length	int	No	Max chars per word; lines with longer words are removed; -1 to disable (default 1000)
filter_lorem_ipsum	bool	No	Remove entire document if "lorem ipsum" found (default True)
filter_javascript	bool	No	Remove lines containing "javascript" (default True)
filter_curly_bracket	bool	No	Remove entire document if "{" found (default True)
filter_policy	bool	No	Remove lines with cookie/policy phrases (default True)
language	str	No	Language for sentence tokenization (default Languages.english)

Filter Input: Document with plain text in doc.text

Filter Output: bool or tuple[bool, str] -- returns True if the document passes (and doc.text is modified to contain only surviving lines), or (False, reason_string) if the document is rejected entirely. Reason strings include:

Reason String	Cause
"lorem_ipsum"	Document contains "lorem ipsum"
"curly_bracket"	Document contains "{"
"too_few_sentences"	Fewer than min_num_sentences after line filtering

Important: This filter modifies doc.text in place. After the filter runs, the document text contains only the lines that passed all line-level checks, joined by newlines (if split_paragraph=True) or spaces (if split_paragraph=False).

Usage Examples

Example 1 -- Default C4 filtering:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter()

Example 2 -- Relaxed settings for shorter documents:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter(
    min_num_sentences=2,         # allow shorter docs
    min_words_per_line=2,        # allow shorter lines
    filter_curly_bracket=False,  # allow code content
)

Example 3 -- Sentence-level splitting instead of paragraph:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter(
    split_paragraph=False,  # use NLTK sentence tokenizer instead of newline split
)

Example 4 -- In a FineWeb-style pipeline:

from datatrove.pipeline.filters import (
    C4QualityFilter,
    GopherQualityFilter,
    GopherRepetitionFilter,
)
from datatrove.pipeline.readers import JsonlReader

pipeline = [
    JsonlReader("s3://my-bucket/input/"),
    GopherRepetitionFilter(),
    GopherQualityFilter(),
    C4QualityFilter(),  # applied after Gopher filters
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment