Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveWordsWithIncorrectSubstringsMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing words containing incorrect substrings provided by Data-Juicer.

Description

RemoveWordsWithIncorrectSubstringsMapper is a mapper operator that removes words containing specified incorrect substrings from text samples. By default, it targets URL-related patterns like "http", "www", ".com", "href", and "//". It operates in two modes: tokenized mode uses a SentencePiece tokenizer to split text into tokens, while non-tokenized mode splits on whitespace, tab, and newline boundaries. In both cases, words containing any of the specified substrings (after stripping special characters) are filtered out and the text is reassembled. Operates in batched mode.

Usage

Use when cleaning web-scraped data where embedded URLs, HTML artifacts, and link fragments pollute the natural language content.

Code Reference

Source Location

  • Repository: Datajuicer_Data_juicer
  • File: data_juicer/ops/mapper/remove_words_with_incorrect_substrings_mapper.py

Signature

@OPERATORS.register_module("remove_words_with_incorrect_substrings_mapper")
class RemoveWordsWithIncorrectSubstringsMapper(Mapper):
    def __init__(
        self, lang: str = "en", tokenization: bool = False, substrings: Optional[List[str]] = None, *args, **kwargs
    ):

Import

from data_juicer.ops.mapper.remove_words_with_incorrect_substrings_mapper import RemoveWordsWithIncorrectSubstringsMapper

I/O Contract

Inputs

Name Type Required Description
lang str No Language of the sample text (default: "en")
tokenization bool No Whether to use a SentencePiece model to tokenize documents (default: False)
substrings Optional[List[str]] No List of incorrect substrings to filter (default: ["http", "www", ".com", "href", "//"])

Outputs

Name Type Description
samples Dict Transformed samples with words containing incorrect substrings removed

Usage Examples

process:
  - remove_words_with_incorrect_substrings_mapper:
      lang: 'en'
      tokenization: false
      substrings:
        - 'http'
        - 'www'
        - '.com'
        - 'href'
        - '//'

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment