Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove FineWebQualityFilter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove, FineWeb Dataset Blog Data_Quality, NLP 2026-02-14 00:00 GMT

Overview

Filter implementation that applies FineWeb-specific quality heuristics to remove documents with low terminal punctuation, excessive short lines, inter-line character duplication, or high newline-to-word ratios.

Description

FineWebQualityFilter extends BaseFilter and applies four empirically derived checks to catch low-quality content that passes through Gopher and C4 filters. The filter splits text into non-empty lines and evaluates each metric sequentially. It reuses the find_duplicates function from the Gopher repetition filter module for the character duplication check. The filter does not modify document text -- it only accepts or rejects documents.

Usage

Applied as the final quality filter stage in FineWeb-style production pipelines, after Gopher repetition filtering, Gopher quality filtering, and C4 quality filtering.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/fineweb_quality_filter.py (L8-56)

Signature:

class FineWebQualityFilter(BaseFilter):
    def __init__(
        self,
        exclusion_writer: DiskWriter | None = None,
        line_punct_thr: float = 0.12,
        line_punct_exclude_zero: bool = False,
        stop_chars: tuple[str, ...] | None = None,
        short_line_thr: float = 0.67,
        short_line_length: int = 30,
        char_duplicates_ratio: float = 0.01,
        new_line_ratio: float = 0.3,
        language: str = Languages.english,
    ):

Import:

from datatrove.pipeline.filters import FineWebQualityFilter

Dependencies:

# Reuses duplicate detection from the Gopher repetition filter module
from datatrove.pipeline.filters.gopher_repetition_filter import find_duplicates
from datatrove.utils.text import TERMINAL_PUNCTUATION, split_into_words

I/O Contract

Inputs:

Parameter Type Required Description
exclusion_writer DiskWriter or None No Writer to save excluded documents (default None)
line_punct_thr float No Min fraction of non-empty lines ending with terminal punctuation (default 0.12)
line_punct_exclude_zero bool No If True, allow documents where zero lines end with punctuation (default False)
stop_chars tuple[str, ...] or None No Custom terminal punctuation characters (default: TERMINAL_PUNCTUATION)
short_line_thr float No Max fraction of lines that are short (default 0.67)
short_line_length int No Character threshold defining a "short" line (default 30)
char_duplicates_ratio float No Max fraction of characters in duplicate lines (default 0.01)
new_line_ratio float No Max ratio of newline characters to total words (default 0.3)
language str No Language for word tokenization (default Languages.english)

Filter Input: Document with plain text in doc.text

Filter Output: bool or tuple[bool, str] -- returns True if the document passes all FineWeb quality checks, or (False, reason_string) if any check fails. Reason strings include:

Reason String Cause
"empty" Document has no non-empty lines
"line_punct_ratio" Terminal punctuation ratio below line_punct_thr
"short_line_ratio" Fraction of short lines exceeds short_line_thr
"char_dup_ratio" Character duplication ratio exceeds char_duplicates_ratio
"list_ratio" Newline-to-word ratio exceeds new_line_ratio

Usage Examples

Example 1 -- Default FineWeb heuristics:

from datatrove.pipeline.filters import FineWebQualityFilter

fineweb_filter = FineWebQualityFilter()

Example 2 -- Stricter thresholds for higher quality:

from datatrove.pipeline.filters import FineWebQualityFilter

fineweb_filter = FineWebQualityFilter(
    line_punct_thr=0.20,         # require more lines with punctuation
    short_line_thr=0.50,         # allow fewer short lines
    char_duplicates_ratio=0.005, # stricter duplication limit
    new_line_ratio=0.2,          # stricter list detection
)

Example 3 -- Allow documents with zero punctuation (e.g., poetry):

from datatrove.pipeline.filters import FineWebQualityFilter

fineweb_filter = FineWebQualityFilter(
    line_punct_exclude_zero=True,  # don't penalize docs with 0% punctuation
)

Example 4 -- Full FineWeb-style pipeline:

from datatrove.pipeline.filters import (
    C4QualityFilter,
    FineWebQualityFilter,
    GopherQualityFilter,
    GopherRepetitionFilter,
)
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers import JsonlWriter

pipeline = [
    JsonlReader("s3://my-bucket/input/"),
    GopherRepetitionFilter(),
    GopherQualityFilter(),
    C4QualityFilter(),
    FineWebQualityFilter(),  # final quality gate
    JsonlWriter("s3://my-bucket/output/"),
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment