Implementation:Datajuicer Data juicer ChineseConvertMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for converting Chinese text between Simplified, Traditional, and Japanese Kanji provided by Data-Juicer.
Description
ChineseConvertMapper is a mapper operator that converts Chinese text between different writing systems using the OpenCC library. It supports 14 conversion modes including Simplified to Traditional (s2t), Traditional to Simplified (t2s), and variants for Taiwan, Hong Kong, and Japanese Kanji. The converter is lazily initialized as a global singleton and reused across calls, with the configuration updated only when the mode changes. It operates in batched mode for efficiency. It extends the Mapper base class.
Usage
Import when you need to normalize Chinese text across different character variants in multilingual datasets.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/chinese_convert_mapper.py
Signature
@OPERATORS.register_module("chinese_convert_mapper")
class ChineseConvertMapper(Mapper):
def __init__(self,
mode: str = "s2t",
*args, **kwargs):
Import
from data_juicer.ops.mapper.chinese_convert_mapper import ChineseConvertMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mode | str | No | Conversion mode. Options: s2t, t2s, s2tw, tw2s, s2hk, hk2s, s2twp, tw2sp, t2tw, tw2t, hk2t, t2hk, t2jp, jp2t. Default: "s2t" |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with converted Chinese text |
Usage Examples
YAML Configuration
process:
- chinese_convert_mapper:
mode: s2t