Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer TextChunkMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for splitting text into configurable chunks provided by Data-Juicer.

Description

TextChunkMapper is a mapper operator that splits input text into smaller chunks based on configurable maximum length and/or split patterns, supporting overlapping chunks for context preservation. It can measure text length in characters or tokens (using HuggingFace, tiktoken, or dashscope tokenizers), split at occurrences of a configurable pattern (default: double newline), and enforce a maximum chunk length with configurable overlap between consecutive chunks.

Usage

Use when you need to prepare long documents for processing by context-limited models, support RAG pipelines, or create appropriately-sized training samples from long-form text.

Code Reference

Source Location

Signature

@OPERATORS.register_module("text_chunk_mapper")
class TextChunkMapper(Mapper):
    def __init__(self, max_len: Union[PositiveInt, None] = None, split_pattern: Union[str, None] = r"\n\n", overlap_len: NonNegativeInt = 0, tokenizer: Union[str, None] = None, trust_remote_code: bool = False, *args, **kwargs):

Import

from data_juicer.ops.mapper.text_chunk_mapper import TextChunkMapper

I/O Contract

Inputs

Name Type Required Description
max_len PositiveInt or None No Maximum length of each chunk; None means no length limit (default: None)
split_pattern str or None No Regex pattern to split text at (default: "\n\n")
overlap_len NonNegativeInt No Overlap length between consecutive chunks (default: 0)
tokenizer str or None No HuggingFace/tiktoken/dashscope tokenizer name for token-based length (default: None)
trust_remote_code bool No Whether to trust remote code of HF models (default: False)

Outputs

Name Type Description
samples Dict Transformed samples with text split into chunks

Usage Examples

process:
  - text_chunk_mapper:
      max_len: 512
      split_pattern: "\n\n"
      overlap_len: 50

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment