Implementation:Datajuicer Data juicer RemoveWordsWithIncorrectSubstringsMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing words containing incorrect substrings provided by Data-Juicer.
Description
RemoveWordsWithIncorrectSubstringsMapper is a mapper operator that removes words containing specified incorrect substrings from text samples. By default, it targets URL-related patterns like "http", "www", ".com", "href", and "//". It operates in two modes: tokenized mode uses a SentencePiece tokenizer to split text into tokens, while non-tokenized mode splits on whitespace, tab, and newline boundaries. In both cases, words containing any of the specified substrings (after stripping special characters) are filtered out and the text is reassembled. Operates in batched mode.
Usage
Use when cleaning web-scraped data where embedded URLs, HTML artifacts, and link fragments pollute the natural language content.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_words_with_incorrect_substrings_mapper.py
Signature
@OPERATORS.register_module("remove_words_with_incorrect_substrings_mapper")
class RemoveWordsWithIncorrectSubstringsMapper(Mapper):
def __init__(
self, lang: str = "en", tokenization: bool = False, substrings: Optional[List[str]] = None, *args, **kwargs
):
Import
from data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper import RemoveWordsWithIncorrectSubstringsMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | No | Language of the sample text (default: "en") |
| tokenization | bool | No | Whether to use a SentencePiece model to tokenize documents (default: False) |
| substrings | Optional[List[str]] | No | List of incorrect substrings to filter (default: ["http", "www", ".com", "href", "//"]) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with words containing incorrect substrings removed |
Usage Examples
process:
- remove_words_with_incorrect_substrings_mapper:
lang: 'en'
tokenization: false
substrings:
- 'http'
- 'www'
- '.com'
- 'href'
- '//'