Implementation:Datajuicer Data juicer TextChunkMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for splitting text into configurable chunks provided by Data-Juicer.

Description

TextChunkMapper is a mapper operator that splits input text into smaller chunks based on configurable maximum length and/or split patterns, supporting overlapping chunks for context preservation. It can measure text length in characters or tokens (using HuggingFace, tiktoken, or dashscope tokenizers), split at occurrences of a configurable pattern (default: double newline), and enforce a maximum chunk length with configurable overlap between consecutive chunks.

Usage

Use when you need to prepare long documents for processing by context-limited models, support RAG pipelines, or create appropriately-sized training samples from long-form text.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/text_chunk_mapper.py

Signature

@OPERATORS.register_module("text_chunk_mapper")
class TextChunkMapper(Mapper):
    def __init__(self, max_len: Union[PositiveInt, None] = None, split_pattern: Union[str, None] = r"\n\n", overlap_len: NonNegativeInt = 0, tokenizer: Union[str, None] = None, trust_remote_code: bool = False, *args, **kwargs):

Import

from data_juicer.ops.mapper.text_chunk_mapper import TextChunkMapper

I/O Contract

Inputs

Name	Type	Required	Description
max_len	PositiveInt or None	No	Maximum length of each chunk; None means no length limit (default: None)
split_pattern	str or None	No	Regex pattern to split text at (default: "\n\n")
overlap_len	NonNegativeInt	No	Overlap length between consecutive chunks (default: 0)
tokenizer	str or None	No	HuggingFace/tiktoken/dashscope tokenizer name for token-based length (default: None)
trust_remote_code	bool	No	Whether to trust remote code of HF models (default: False)

Outputs

Name	Type	Description
samples	Dict	Transformed samples with text split into chunks

Usage Examples

process:
  - text_chunk_mapper:
      max_len: 512
      split_pattern: "\n\n"
      overlap_len: 50

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment