Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer SentenceSplitMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for splitting text into individual sentences provided by Data-Juicer.

Description

SentenceSplitMapper is a mapper operator that splits text samples into individual sentences using an NLTK-based sentence tokenizer. It prepares an NLTK sentence tokenizer for the specified language (default: English), applies the NLTK pickle security patch, and uses the tokenizer's tokenize method to split each text sample into sentences in a batched operation. The original text in each sample is replaced with a list of sentences.

Usage

Use when you need to decompose documents or text samples into sentence-level units for downstream sentence-level processing, analysis, or filtering in the data pipeline.

Code Reference

Source Location

Signature

@OPERATORS.register_module("sentence_split_mapper")
class SentenceSplitMapper(Mapper):
    def __init__(self, lang: str = "en", *args, **kwargs):

Import

from data_juicer.ops.mapper.sentence_split_mapper import SentenceSplitMapper

I/O Contract

Inputs

Name Type Required Description
lang str No Language code for NLTK sentence tokenizer (default: "en")

Outputs

Name Type Description
samples Dict Transformed samples with text replaced by list of sentences

Usage Examples

process:
  - sentence_split_mapper:
      lang: "en"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment