Implementation:Huggingface Datatrove FineWebQualityFilter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove, FineWeb Dataset Blog | Data_Quality, NLP | 2026-02-14 00:00 GMT |
Overview
Filter implementation that applies FineWeb-specific quality heuristics to remove documents with low terminal punctuation, excessive short lines, inter-line character duplication, or high newline-to-word ratios.
Description
FineWebQualityFilter extends BaseFilter and applies four empirically derived checks to catch low-quality content that passes through Gopher and C4 filters. The filter splits text into non-empty lines and evaluates each metric sequentially. It reuses the find_duplicates function from the Gopher repetition filter module for the character duplication check. The filter does not modify document text -- it only accepts or rejects documents.
Usage
Applied as the final quality filter stage in FineWeb-style production pipelines, after Gopher repetition filtering, Gopher quality filtering, and C4 quality filtering.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/filters/fineweb_quality_filter.py (L8-56)
Signature:
class FineWebQualityFilter(BaseFilter):
def __init__(
self,
exclusion_writer: DiskWriter | None = None,
line_punct_thr: float = 0.12,
line_punct_exclude_zero: bool = False,
stop_chars: tuple[str, ...] | None = None,
short_line_thr: float = 0.67,
short_line_length: int = 30,
char_duplicates_ratio: float = 0.01,
new_line_ratio: float = 0.3,
language: str = Languages.english,
):
Import:
from datatrove.pipeline.filters import FineWebQualityFilter
Dependencies:
# Reuses duplicate detection from the Gopher repetition filter module
from datatrove.pipeline.filters.gopher_repetition_filter import find_duplicates
from datatrove.utils.text import TERMINAL_PUNCTUATION, split_into_words
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| exclusion_writer | DiskWriter or None | No | Writer to save excluded documents (default None) |
| line_punct_thr | float | No | Min fraction of non-empty lines ending with terminal punctuation (default 0.12) |
| line_punct_exclude_zero | bool | No | If True, allow documents where zero lines end with punctuation (default False) |
| stop_chars | tuple[str, ...] or None | No | Custom terminal punctuation characters (default: TERMINAL_PUNCTUATION) |
| short_line_thr | float | No | Max fraction of lines that are short (default 0.67) |
| short_line_length | int | No | Character threshold defining a "short" line (default 30) |
| char_duplicates_ratio | float | No | Max fraction of characters in duplicate lines (default 0.01) |
| new_line_ratio | float | No | Max ratio of newline characters to total words (default 0.3) |
| language | str | No | Language for word tokenization (default Languages.english) |
Filter Input: Document with plain text in doc.text
Filter Output: bool or tuple[bool, str] -- returns True if the document passes all FineWeb quality checks, or (False, reason_string) if any check fails. Reason strings include:
| Reason String | Cause |
|---|---|
| "empty" | Document has no non-empty lines |
| "line_punct_ratio" | Terminal punctuation ratio below line_punct_thr |
| "short_line_ratio" | Fraction of short lines exceeds short_line_thr |
| "char_dup_ratio" | Character duplication ratio exceeds char_duplicates_ratio |
| "list_ratio" | Newline-to-word ratio exceeds new_line_ratio |
Usage Examples
Example 1 -- Default FineWeb heuristics:
from datatrove.pipeline.filters import FineWebQualityFilter
fineweb_filter = FineWebQualityFilter()
Example 2 -- Stricter thresholds for higher quality:
from datatrove.pipeline.filters import FineWebQualityFilter
fineweb_filter = FineWebQualityFilter(
line_punct_thr=0.20, # require more lines with punctuation
short_line_thr=0.50, # allow fewer short lines
char_duplicates_ratio=0.005, # stricter duplication limit
new_line_ratio=0.2, # stricter list detection
)
Example 3 -- Allow documents with zero punctuation (e.g., poetry):
from datatrove.pipeline.filters import FineWebQualityFilter
fineweb_filter = FineWebQualityFilter(
line_punct_exclude_zero=True, # don't penalize docs with 0% punctuation
)
Example 4 -- Full FineWeb-style pipeline:
from datatrove.pipeline.filters import (
C4QualityFilter,
FineWebQualityFilter,
GopherQualityFilter,
GopherRepetitionFilter,
)
from datatrove.pipeline.readers import JsonlReader
from datatrove.pipeline.writers import JsonlWriter
pipeline = [
JsonlReader("s3://my-bucket/input/"),
GopherRepetitionFilter(),
GopherQualityFilter(),
C4QualityFilter(),
FineWebQualityFilter(), # final quality gate
JsonlWriter("s3://my-bucket/output/"),
]