Implementation:Datajuicer Data juicer SentenceSplitMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for splitting text into individual sentences provided by Data-Juicer.

Description

SentenceSplitMapper is a mapper operator that splits text samples into individual sentences using an NLTK-based sentence tokenizer. It prepares an NLTK sentence tokenizer for the specified language (default: English), applies the NLTK pickle security patch, and uses the tokenizer's tokenize method to split each text sample into sentences in a batched operation. The original text in each sample is replaced with a list of sentences.

Usage

Use when you need to decompose documents or text samples into sentence-level units for downstream sentence-level processing, analysis, or filtering in the data pipeline.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/sentence_split_mapper.py

Signature

@OPERATORS.register_module("sentence_split_mapper")
class SentenceSplitMapper(Mapper):
    def __init__(self, lang: str = "en", *args, **kwargs):

Import

from data_juicer.ops.mapper.sentence_split_mapper import SentenceSplitMapper

I/O Contract

Inputs

Name	Type	Required	Description
lang	str	No	Language code for NLTK sentence tokenizer (default: "en")

Outputs

Name	Type	Description
samples	Dict	Transformed samples with text replaced by list of sentences

Usage Examples

process:
  - sentence_split_mapper:
      lang: "en"

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment