Principle:Elevenlabs Elevenlabs python Text Chunking
| Knowledge Sources | |
|---|---|
| Domains | NLP, Streaming, Text_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
A buffering algorithm that splits a stream of text fragments into sentence-boundary-aligned chunks suitable for speech synthesis, ensuring natural prosody in generated audio.
Description
Text Chunking addresses a fundamental challenge in streaming TTS: text from sources like LLMs arrives in arbitrary fragments (words, partial words, tokens) that don't align with natural speech boundaries. Sending these fragments directly to TTS would produce unnatural prosody because the synthesis model needs sentence-level context to generate proper intonation.
The chunking algorithm buffers incoming text fragments and emits chunks when a sentence boundary is detected. Sentence boundaries are identified by a set of splitter characters (periods, commas, question marks, exclamation marks, semicolons, colons, dashes, and bracket characters). Each emitted chunk ends with a space to ensure clean concatenation.
This preprocessing step is critical for maintaining audio quality in realtime TTS pipelines.
Usage
Use this principle whenever streaming text to a TTS system. The text chunker should sit between the text source (LLM, user input, etc.) and the WebSocket TTS endpoint. It is automatically applied inside convert_realtime but can also be used independently for custom streaming pipelines.
Theoretical Basis
The algorithm maintains a buffer and applies a greedy split-at-boundary strategy:
# Abstract algorithm
splitters = (".", ",", "?", "!", ";", ":", "—", "-", "(", ")", "[", "]", "}", " ")
buffer = ""
for fragment in text_stream:
if buffer.ends_with(splitter):
yield buffer # Emit at boundary
buffer = fragment
elif fragment.starts_with(splitter):
yield buffer + fragment[0] # Include boundary char
buffer = fragment[1:]
else:
buffer += fragment # Continue buffering
if buffer:
yield buffer # Flush remaining
This ensures each yielded chunk contains a complete clause or sentence, allowing the TTS model to apply appropriate prosody.