Implementation:Huggingface Datatrove RegexFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Text Filtering, Pattern Matching
Last Updated	2026-02-14 17:00 GMT

Overview

RegexFilter is a document filter that drops documents whose text contains at least one match for a given regular expression pattern.

Description

RegexFilter extends BaseFilter to provide pattern-based document filtering using Python's re module. The filter takes a regular expression string at initialization, compiles it into a regex pattern object for efficiency, and then searches each document's text for any match. If the regex finds at least one match in the document text, the document is dropped (the filter returns False); if no match is found, the document is kept (the filter returns True).

This inverted logic (match means drop) makes the filter suitable for blacklist-style patterns: unwanted content patterns like spam signatures, boilerplate text, inappropriate content markers, or other undesirable text sequences. The compiled regex is stored as the self.regex attribute, ensuring that the pattern is compiled only once and reused across all documents in the pipeline.

The filter is intentionally minimal at 29 lines, relying entirely on BaseFilter for pipeline integration, statistics tracking, and optional exclusion writing. It uses re.search rather than re.match, so the pattern can match anywhere in the document text, not just at the beginning.

Usage

Use RegexFilter when you need to remove documents containing specific text patterns, such as boilerplate notices, spam indicators, or unwanted content markers. For keeping documents that match a pattern (rather than dropping them), you would need to invert the logic in a custom filter.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/regex_filter.py
Lines: 1-29

Signature

class RegexFilter(BaseFilter):
    name = "🕵 Regex"

    def __init__(
        self,
        regex_exp: str,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.regex_filter import RegexFilter

I/O Contract

Inputs

Name	Type	Required	Description
regex_exp	str	Yes	Regular expression pattern string; documents matching this pattern are dropped
exclusion_writer	DiskWriter	No	Optional writer for saving dropped documents

Outputs

Name	Type	Description
data	DocumentsPipeline (generator)	Yields documents whose text does not match the regex pattern

Usage Examples

Basic Usage

from datatrove.pipeline.filters.regex_filter import RegexFilter

# Drop documents containing email addresses
email_filter = RegexFilter(regex_exp=r"[\w.+-]+@[\w-]+\.[\w.-]+")

# Drop documents containing "copyright" notices (case-insensitive via regex flag)
copyright_filter = RegexFilter(regex_exp=r"(?i)copyright\s+\d{4}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment