Implementation:Datajuicer Data juicer RemoveWordsWithIncorrectSubstringsMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for removing words containing incorrect substrings provided by Data-Juicer.

Description

RemoveWordsWithIncorrectSubstringsMapper is a mapper operator that removes words containing specified incorrect substrings from text samples. By default, it targets URL-related patterns like "http", "www", ".com", "href", and "//". It operates in two modes: tokenized mode uses a SentencePiece tokenizer to split text into tokens, while non-tokenized mode splits on whitespace, tab, and newline boundaries. In both cases, words containing any of the specified substrings (after stripping special characters) are filtered out and the text is reassembled. Operates in batched mode.

Usage

Use when cleaning web-scraped data where embedded URLs, HTML artifacts, and link fragments pollute the natural language content.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/remove_words_with_incorrect_substrings_mapper.py

Signature

@OPERATORS.register_module("remove_words_with_incorrect_substrings_mapper")
class RemoveWordsWithIncorrectSubstringsMapper(Mapper):
    def __init__(
        self, lang: str = "en", tokenization: bool = False, substrings: Optional[List[str]] = None, *args, **kwargs
    ):

Import

from data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper import RemoveWordsWithIncorrectSubstringsMapper

I/O Contract

Inputs

Name	Type	Required	Description
lang	str	No	Language of the sample text (default: "en")
tokenization	bool	No	Whether to use a SentencePiece model to tokenize documents (default: False)
substrings	Optional[List[str]]	No	List of incorrect substrings to filter (default: ["http", "www", ".com", "href", "//"])

Outputs

Name	Type	Description
samples	Dict	Transformed samples with words containing incorrect substrings removed

Usage Examples

process:
  - remove_words_with_incorrect_substrings_mapper:
      lang: 'en'
      tokenization: false
      substrings:
        - 'http'
        - 'www'
        - '.com'
        - 'href'
        - '//'

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment