Implementation:Run llama Llama index SentenceSplitter Configuration
| Knowledge Sources | |
|---|---|
| Domains | Data_Preprocessing, RAG, NLP |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
The SentenceSplitter is LlamaIndex's default text chunking implementation that splits text at sentence boundaries while respecting configurable chunk size and overlap constraints.
Description
SentenceSplitter extends MetadataAwareTextSplitter and performs sentence-aware splitting using a configurable tokenizer (defaults to NLTK's PunktSentenceTokenizer via the nltk package). It first splits text into sentences, then combines consecutive sentences into chunks that fit within the chunk_size limit. When a single sentence exceeds the limit, it falls back to splitting by paragraph separators or a secondary regex pattern.
The metadata-aware variant (split_text_metadata_aware) accounts for metadata string length when calculating effective chunk size, ensuring the final node (text + metadata) fits within limits.
Usage
Use SentenceSplitter as the default node parser for most RAG pipelines. Configure chunk_size and chunk_overlap based on your embedding model's context window and retrieval granularity requirements.
Code Reference
Source Location
- Repository: llama_index
- File: llama-index-core/llama_index/core/node_parser/text/sentence.py
- Lines: L34-331
Signature
class SentenceSplitter(MetadataAwareTextSplitter):
def __init__(
self,
separator: str = " ",
chunk_size: int = DEFAULT_CHUNK_SIZE,
chunk_overlap: int = SENTENCE_CHUNK_OVERLAP,
tokenizer: Optional[Callable] = None,
paragraph_separator: str = DEFAULT_PARAGRAPH_SEP,
chunking_tokenizer_fn: Optional[Callable[[str], List[str]]] = None,
secondary_chunking_regex: Optional[str] = None,
include_metadata: bool = True,
include_prev_next_rel: bool = True,
) -> None:
Key Methods
def split_text(self, text: str) -> List[str]:
"""Split text into chunks respecting sentence boundaries."""
def split_text_metadata_aware(
self, text: str, metadata_str: str
) -> List[str]:
"""Split text accounting for metadata string length."""
Import
from llama_index.core.node_parser import SentenceSplitter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| separator | str | No (default: " ") | Character used for splitting within sentences when they exceed chunk_size |
| chunk_size | int | No (default: DEFAULT_CHUNK_SIZE) | Maximum number of tokens per chunk |
| chunk_overlap | int | No (default: SENTENCE_CHUNK_OVERLAP) | Number of overlapping tokens between consecutive chunks |
| tokenizer | Optional[Callable] | No | Custom tokenizer function for counting tokens |
| paragraph_separator | str | No (default: "\n\n\n") | Separator used for paragraph-level splitting |
| chunking_tokenizer_fn | Optional[Callable] | No | Custom function for splitting text into sentences |
| secondary_chunking_regex | Optional[str] | No | Regex pattern for secondary splitting when sentences are too long |
| include_metadata | bool | No (default: True) | Whether to include node metadata in output |
| include_prev_next_rel | bool | No (default: True) | Whether to set prev/next relationships between nodes |
Outputs
| Name | Type | Description |
|---|---|---|
| return (split_text) | List[str] | List of text chunks split at sentence boundaries |
| return (split_text_metadata_aware) | List[str] | List of text chunks accounting for metadata length |
Usage Examples
Basic Sentence Splitting
from llama_index.core.node_parser import SentenceSplitter
splitter = SentenceSplitter(
chunk_size=1024,
chunk_overlap=200,
)
# Split raw text
chunks = splitter.split_text("Long document text with many sentences...")
Using as a Pipeline Transformation
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
splitter = SentenceSplitter(
chunk_size=512,
chunk_overlap=50,
paragraph_separator="\n\n",
)
pipeline = IngestionPipeline(
transformations=[splitter],
)