Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer PunctuationNormalizationMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for normalizing Unicode punctuation to English equivalents provided by Data-Juicer.

Description

PunctuationNormalizationMapper is a mapper operator that normalizes Unicode punctuation characters to their ASCII/English equivalents in text samples, ensuring consistent punctuation formatting across multilingual datasets. It maintains a hardcoded dictionary mapping over 30 Unicode punctuation characters (e.g., full-width commas, Chinese quotation marks, em dashes) to their English counterparts and iterates over each character in the text, replacing matches in a batched operation.

Usage

Use when processing multilingual training data where inconsistent punctuation encoding could introduce noise into language model training.

Code Reference

Source Location

Signature

@OPERATORS.register_module("punctuation_normalization_mapper")
class PunctuationNormalizationMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.punctuation_normalization_mapper import PunctuationNormalizationMapper

I/O Contract

Inputs

Name Type Required Description
text str Yes Text samples containing Unicode punctuation to normalize

Outputs

Name Type Description
samples Dict Transformed samples with normalized punctuation

Usage Examples

process:
  - punctuation_normalization_mapper:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment