Implementation:Datajuicer Data juicer SentenceSplitMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for splitting text into individual sentences provided by Data-Juicer.
Description
SentenceSplitMapper is a mapper operator that splits text samples into individual sentences using an NLTK-based sentence tokenizer. It prepares an NLTK sentence tokenizer for the specified language (default: English), applies the NLTK pickle security patch, and uses the tokenizer's tokenize method to split each text sample into sentences in a batched operation. The original text in each sample is replaced with a list of sentences.
Usage
Use when you need to decompose documents or text samples into sentence-level units for downstream sentence-level processing, analysis, or filtering in the data pipeline.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/sentence_split_mapper.py
Signature
@OPERATORS.register_module("sentence_split_mapper")
class SentenceSplitMapper(Mapper):
def __init__(self, lang: str = "en", *args, **kwargs):
Import
from data_juicer.ops.mapper.sentence_split_mapper import SentenceSplitMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| lang | str | No | Language code for NLTK sentence tokenizer (default: "en") |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with text replaced by list of sentences |
Usage Examples
process:
- sentence_split_mapper:
lang: "en"