Implementation:Huggingface Datatrove RegexFilter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Filtering, Pattern Matching |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
RegexFilter is a document filter that drops documents whose text contains at least one match for a given regular expression pattern.
Description
RegexFilter extends BaseFilter to provide pattern-based document filtering using Python's re module. The filter takes a regular expression string at initialization, compiles it into a regex pattern object for efficiency, and then searches each document's text for any match. If the regex finds at least one match in the document text, the document is dropped (the filter returns False); if no match is found, the document is kept (the filter returns True).
This inverted logic (match means drop) makes the filter suitable for blacklist-style patterns: unwanted content patterns like spam signatures, boilerplate text, inappropriate content markers, or other undesirable text sequences. The compiled regex is stored as the self.regex attribute, ensuring that the pattern is compiled only once and reused across all documents in the pipeline.
The filter is intentionally minimal at 29 lines, relying entirely on BaseFilter for pipeline integration, statistics tracking, and optional exclusion writing. It uses re.search rather than re.match, so the pattern can match anywhere in the document text, not just at the beginning.
Usage
Use RegexFilter when you need to remove documents containing specific text patterns, such as boilerplate notices, spam indicators, or unwanted content markers. For keeping documents that match a pattern (rather than dropping them), you would need to invert the logic in a custom filter.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/regex_filter.py
- Lines: 1-29
Signature
class RegexFilter(BaseFilter):
name = "🕵 Regex"
def __init__(
self,
regex_exp: str,
exclusion_writer: DiskWriter = None,
):
...
def filter(self, doc: Document) -> bool:
...
Import
from datatrove.pipeline.filters.regex_filter import RegexFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| regex_exp | str | Yes | Regular expression pattern string; documents matching this pattern are dropped |
| exclusion_writer | DiskWriter | No | Optional writer for saving dropped documents |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields documents whose text does not match the regex pattern |
Usage Examples
Basic Usage
from datatrove.pipeline.filters.regex_filter import RegexFilter
# Drop documents containing email addresses
email_filter = RegexFilter(regex_exp=r"[\w.+-]+@[\w-]+\.[\w.-]+")
# Drop documents containing "copyright" notices (case-insensitive via regex flag)
copyright_filter = RegexFilter(regex_exp=r"(?i)copyright\s+\d{4}")