Implementation:Datajuicer Data juicer PunctuationNormalizationMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for normalizing Unicode punctuation to English equivalents provided by Data-Juicer.
Description
PunctuationNormalizationMapper is a mapper operator that normalizes Unicode punctuation characters to their ASCII/English equivalents in text samples, ensuring consistent punctuation formatting across multilingual datasets. It maintains a hardcoded dictionary mapping over 30 Unicode punctuation characters (e.g., full-width commas, Chinese quotation marks, em dashes) to their English counterparts and iterates over each character in the text, replacing matches in a batched operation.
Usage
Use when processing multilingual training data where inconsistent punctuation encoding could introduce noise into language model training.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/punctuation_normalization_mapper.py
Signature
@OPERATORS.register_module("punctuation_normalization_mapper")
class PunctuationNormalizationMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.punctuation_normalization_mapper import PunctuationNormalizationMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Text samples containing Unicode punctuation to normalize |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with normalized punctuation |
Usage Examples
process:
- punctuation_normalization_mapper: