Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove RegexFilter

From Leeroopedia
Knowledge Sources
Domains Data Processing, Text Filtering, Pattern Matching
Last Updated 2026-02-14 17:00 GMT

Overview

RegexFilter is a document filter that drops documents whose text contains at least one match for a given regular expression pattern.

Description

RegexFilter extends BaseFilter to provide pattern-based document filtering using Python's re module. The filter takes a regular expression string at initialization, compiles it into a regex pattern object for efficiency, and then searches each document's text for any match. If the regex finds at least one match in the document text, the document is dropped (the filter returns False); if no match is found, the document is kept (the filter returns True).

This inverted logic (match means drop) makes the filter suitable for blacklist-style patterns: unwanted content patterns like spam signatures, boilerplate text, inappropriate content markers, or other undesirable text sequences. The compiled regex is stored as the self.regex attribute, ensuring that the pattern is compiled only once and reused across all documents in the pipeline.

The filter is intentionally minimal at 29 lines, relying entirely on BaseFilter for pipeline integration, statistics tracking, and optional exclusion writing. It uses re.search rather than re.match, so the pattern can match anywhere in the document text, not just at the beginning.

Usage

Use RegexFilter when you need to remove documents containing specific text patterns, such as boilerplate notices, spam indicators, or unwanted content markers. For keeping documents that match a pattern (rather than dropping them), you would need to invert the logic in a custom filter.

Code Reference

Source Location

Signature

class RegexFilter(BaseFilter):
    name = "🕵 Regex"

    def __init__(
        self,
        regex_exp: str,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.regex_filter import RegexFilter

I/O Contract

Inputs

Name Type Required Description
regex_exp str Yes Regular expression pattern string; documents matching this pattern are dropped
exclusion_writer DiskWriter No Optional writer for saving dropped documents

Outputs

Name Type Description
data DocumentsPipeline (generator) Yields documents whose text does not match the regex pattern

Usage Examples

Basic Usage

from datatrove.pipeline.filters.regex_filter import RegexFilter

# Drop documents containing email addresses
email_filter = RegexFilter(regex_exp=r"[\w.+-]+@[\w-]+\.[\w.-]+")

# Drop documents containing "copyright" notices (case-insensitive via regex flag)
copyright_filter = RegexFilter(regex_exp=r"(?i)copyright\s+\d{4}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment