Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveSpecificCharsMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing specific characters from text samples provided by Data-Juicer.

Description

RemoveSpecificCharsMapper is a mapper operator that removes a configurable set of specific characters from text samples. The characters to be removed can be provided as a string or a list of strings. By default, it removes decorative symbols like bullets, arrows, and card suits. It builds a regex character class from the specified characters and applies re.sub with the DOTALL flag to remove all matching characters in a batched operation.

Usage

Use when stripping decorative or non-semantic characters that are common in web-scraped content from text training data.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_specific_chars_mapper")
class RemoveSpecificCharsMapper(Mapper):
    def __init__(self, chars_to_remove: Union[str, List[str]] = "◆●■►▼▲▴∆▻▷❖♡□", *args, **kwargs):

Import

from data_juicer.ops.mapper.remove_specific_chars_mapper import RemoveSpecificCharsMapper

I/O Contract

Inputs

Name Type Required Description
chars_to_remove Union[str, List[str]] No Characters to remove from text (default: decorative symbols)

Outputs

Name Type Description
samples Dict Transformed samples with specified characters removed

Usage Examples

process:
  - remove_specific_chars_mapper:
      chars_to_remove: '◆●■►▼▲▴∆▻▷❖♡□'

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment