Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove C4QualityFilter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove, C4 / T5 (Raffel et al. 2020) Data_Quality, NLP 2026-02-14 00:00 GMT

Overview

Filter implementation that applies C4 paper heuristic rules at both line and document levels, removing boilerplate lines while preserving usable content, and rejecting entire documents that contain code or placeholder text.

Description

C4QualityFilter extends BaseFilter and implements a two-phase filtering approach. First, it iterates over all lines (split by paragraph or sentence), applying line-level checks and keeping only lines that pass. Then it applies document-level checks on the filtered result. The document text is modified in place -- after filtering, doc.text contains only the surviving lines joined together. This is unlike most other filters which only accept or reject documents without modifying their content.

Usage

Used as supplementary quality filtering in production pipelines alongside Gopher filters. Particularly effective for cleaning web crawl data where pages mix article text with navigation, cookie notices, and JavaScript-related content.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/c4_filters.py (L27-136)

Signature:

class C4QualityFilter(BaseFilter):
    def __init__(
        self,
        exclusion_writer: DiskWriter = None,
        split_paragraph: bool = True,
        remove_citations: bool = True,
        filter_no_terminal_punct: bool = True,
        min_num_sentences: int = 5,
        min_words_per_line: int = 3,
        max_word_length: int = 1000,
        filter_lorem_ipsum: bool = True,
        filter_javascript: bool = True,
        filter_curly_bracket: bool = True,
        filter_policy: bool = True,
        language: str = Languages.english,
    ):

Import:

from datatrove.pipeline.filters import C4QualityFilter

Key Constants:

CITATION_REGEX = re.compile(r"\[\d*]|\[edit]|\[citation needed]")
END_PUNCTUATION = (".", "?", "!", '"', "'")
ELLIPSIS = "..."
POLICY_SUBSTRINGS = [
    "terms of use", "privacy policy", "cookie policy",
    "uses cookies", "use of cookies", "use cookies",
]

I/O Contract

Inputs:

Parameter Type Required Description
exclusion_writer DiskWriter or None No Writer to save excluded documents (default None)
split_paragraph bool No Split by newline (True) or by sentence tokenizer (False) (default True)
remove_citations bool No Remove Wikipedia-style citations like [1], [edit] (default True)
filter_no_terminal_punct bool No Remove lines without terminal punctuation (default True)
min_num_sentences int No Min sentences after filtering; set to -1 to disable (default 5)
min_words_per_line int No Min words per line; lines below this are removed (default 3)
max_word_length int No Max chars per word; lines with longer words are removed; -1 to disable (default 1000)
filter_lorem_ipsum bool No Remove entire document if "lorem ipsum" found (default True)
filter_javascript bool No Remove lines containing "javascript" (default True)
filter_curly_bracket bool No Remove entire document if "{" found (default True)
filter_policy bool No Remove lines with cookie/policy phrases (default True)
language str No Language for sentence tokenization (default Languages.english)

Filter Input: Document with plain text in doc.text

Filter Output: bool or tuple[bool, str] -- returns True if the document passes (and doc.text is modified to contain only surviving lines), or (False, reason_string) if the document is rejected entirely. Reason strings include:

Reason String Cause
"lorem_ipsum" Document contains "lorem ipsum"
"curly_bracket" Document contains "{"
"too_few_sentences" Fewer than min_num_sentences after line filtering

Important: This filter modifies doc.text in place. After the filter runs, the document text contains only the lines that passed all line-level checks, joined by newlines (if split_paragraph=True) or spaces (if split_paragraph=False).

Usage Examples

Example 1 -- Default C4 filtering:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter()

Example 2 -- Relaxed settings for shorter documents:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter(
    min_num_sentences=2,         # allow shorter docs
    min_words_per_line=2,        # allow shorter lines
    filter_curly_bracket=False,  # allow code content
)

Example 3 -- Sentence-level splitting instead of paragraph:

from datatrove.pipeline.filters import C4QualityFilter

c4_filter = C4QualityFilter(
    split_paragraph=False,  # use NLTK sentence tokenizer instead of newline split
)

Example 4 -- In a FineWeb-style pipeline:

from datatrove.pipeline.filters import (
    C4QualityFilter,
    GopherQualityFilter,
    GopherRepetitionFilter,
)
from datatrove.pipeline.readers import JsonlReader

pipeline = [
    JsonlReader("s3://my-bucket/input/"),
    GopherRepetitionFilter(),
    GopherQualityFilter(),
    C4QualityFilter(),  # applied after Gopher filters
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment