Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index SentenceSplitter Configuration

From Leeroopedia
Knowledge Sources
Domains Data_Preprocessing, RAG, NLP
Last Updated 2026-02-11 00:00 GMT

Overview

The SentenceSplitter is LlamaIndex's default text chunking implementation that splits text at sentence boundaries while respecting configurable chunk size and overlap constraints.

Description

SentenceSplitter extends MetadataAwareTextSplitter and performs sentence-aware splitting using a configurable tokenizer (defaults to NLTK's PunktSentenceTokenizer via the nltk package). It first splits text into sentences, then combines consecutive sentences into chunks that fit within the chunk_size limit. When a single sentence exceeds the limit, it falls back to splitting by paragraph separators or a secondary regex pattern.

The metadata-aware variant (split_text_metadata_aware) accounts for metadata string length when calculating effective chunk size, ensuring the final node (text + metadata) fits within limits.

Usage

Use SentenceSplitter as the default node parser for most RAG pipelines. Configure chunk_size and chunk_overlap based on your embedding model's context window and retrieval granularity requirements.

Code Reference

Source Location

  • Repository: llama_index
  • File: llama-index-core/llama_index/core/node_parser/text/sentence.py
  • Lines: L34-331

Signature

class SentenceSplitter(MetadataAwareTextSplitter):
    def __init__(
        self,
        separator: str = " ",
        chunk_size: int = DEFAULT_CHUNK_SIZE,
        chunk_overlap: int = SENTENCE_CHUNK_OVERLAP,
        tokenizer: Optional[Callable] = None,
        paragraph_separator: str = DEFAULT_PARAGRAPH_SEP,
        chunking_tokenizer_fn: Optional[Callable[[str], List[str]]] = None,
        secondary_chunking_regex: Optional[str] = None,
        include_metadata: bool = True,
        include_prev_next_rel: bool = True,
    ) -> None:

Key Methods

def split_text(self, text: str) -> List[str]:
    """Split text into chunks respecting sentence boundaries."""

def split_text_metadata_aware(
    self, text: str, metadata_str: str
) -> List[str]:
    """Split text accounting for metadata string length."""

Import

from llama_index.core.node_parser import SentenceSplitter

I/O Contract

Inputs

Name Type Required Description
separator str No (default: " ") Character used for splitting within sentences when they exceed chunk_size
chunk_size int No (default: DEFAULT_CHUNK_SIZE) Maximum number of tokens per chunk
chunk_overlap int No (default: SENTENCE_CHUNK_OVERLAP) Number of overlapping tokens between consecutive chunks
tokenizer Optional[Callable] No Custom tokenizer function for counting tokens
paragraph_separator str No (default: "\n\n\n") Separator used for paragraph-level splitting
chunking_tokenizer_fn Optional[Callable] No Custom function for splitting text into sentences
secondary_chunking_regex Optional[str] No Regex pattern for secondary splitting when sentences are too long
include_metadata bool No (default: True) Whether to include node metadata in output
include_prev_next_rel bool No (default: True) Whether to set prev/next relationships between nodes

Outputs

Name Type Description
return (split_text) List[str] List of text chunks split at sentence boundaries
return (split_text_metadata_aware) List[str] List of text chunks accounting for metadata length

Usage Examples

Basic Sentence Splitting

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=200,
)

# Split raw text
chunks = splitter.split_text("Long document text with many sentences...")

Using as a Pipeline Transformation

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline

splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
    paragraph_separator="\n\n",
)

pipeline = IngestionPipeline(
    transformations=[splitter],
)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment