Implementation:Datajuicer Data juicer RemoveSpecificCharsMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing specific characters from text samples provided by Data-Juicer.
Description
RemoveSpecificCharsMapper is a mapper operator that removes a configurable set of specific characters from text samples. The characters to be removed can be provided as a string or a list of strings. By default, it removes decorative symbols like bullets, arrows, and card suits. It builds a regex character class from the specified characters and applies re.sub with the DOTALL flag to remove all matching characters in a batched operation.
Usage
Use when stripping decorative or non-semantic characters that are common in web-scraped content from text training data.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_specific_chars_mapper.py
Signature
@OPERATORS.register_module("remove_specific_chars_mapper")
class RemoveSpecificCharsMapper(Mapper):
def __init__(self, chars_to_remove: Union[str, List[str]] = "◆●■►▼▲▴∆▻▷❖♡□", *args, **kwargs):
Import
from data_juicer.ops.mapper.remove_specific_chars_mapper import RemoveSpecificCharsMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| chars_to_remove | Union[str, List[str]] | No | Characters to remove from text (default: decorative symbols) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with specified characters removed |
Usage Examples
process:
- remove_specific_chars_mapper:
chars_to_remove: '◆●■►▼▲▴∆▻▷❖♡□'