Implementation:Datajuicer Data juicer WhitespaceNormalizationMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for normalizing whitespace characters in text samples provided by Data-Juicer.
Description
WhitespaceNormalizationMapper normalizes all types of whitespace characters (tabs, newlines, non-breaking spaces, and other Unicode whitespace) to standard ASCII spaces (0x20), and trims leading and trailing whitespace from text samples. It iterates over each character and replaces any character found in a comprehensive list of Unicode whitespace characters (from the VARIOUS_WHITESPACES constant) with a standard space character, processing in batched mode for efficiency. The operator is based on the whitespace character list from the bigscience data-preparation project.
Usage
Use when you need to ensure consistent whitespace encoding across training data, preventing tokenization issues caused by non-standard whitespace characters such as tabs, newlines, and Unicode whitespace variants.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/whitespace_normalization_mapper.py
Signature
@OPERATORS.register_module("whitespace_normalization_mapper")
class WhitespaceNormalizationMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.whitespace_normalization_mapper import WhitespaceNormalizationMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sample[text_key] | str | Yes | The text content to normalize whitespace in |
Outputs
| Name | Type | Description |
|---|---|---|
| sample[text_key] | str | Text with all whitespace characters normalized to standard spaces and leading/trailing whitespace trimmed |
Usage Examples
process:
- whitespace_normalization_mapper: