Implementation:Datajuicer Data juicer TextChunkMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for splitting text into configurable chunks provided by Data-Juicer.
Description
TextChunkMapper is a mapper operator that splits input text into smaller chunks based on configurable maximum length and/or split patterns, supporting overlapping chunks for context preservation. It can measure text length in characters or tokens (using HuggingFace, tiktoken, or dashscope tokenizers), split at occurrences of a configurable pattern (default: double newline), and enforce a maximum chunk length with configurable overlap between consecutive chunks.
Usage
Use when you need to prepare long documents for processing by context-limited models, support RAG pipelines, or create appropriately-sized training samples from long-form text.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/text_chunk_mapper.py
Signature
@OPERATORS.register_module("text_chunk_mapper")
class TextChunkMapper(Mapper):
def __init__(self, max_len: Union[PositiveInt, None] = None, split_pattern: Union[str, None] = r"\n\n", overlap_len: NonNegativeInt = 0, tokenizer: Union[str, None] = None, trust_remote_code: bool = False, *args, **kwargs):
Import
from data_juicer.ops.mapper.text_chunk_mapper import TextChunkMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| max_len | PositiveInt or None | No | Maximum length of each chunk; None means no length limit (default: None) |
| split_pattern | str or None | No | Regex pattern to split text at (default: "\n\n") |
| overlap_len | NonNegativeInt | No | Overlap length between consecutive chunks (default: 0) |
| tokenizer | str or None | No | HuggingFace/tiktoken/dashscope tokenizer name for token-based length (default: None) |
| trust_remote_code | bool | No | Whether to trust remote code of HF models (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with text split into chunks |
Usage Examples
process:
- text_chunk_mapper:
max_len: 512
split_pattern: "\n\n"
overlap_len: 50