Implementation:Neuml Txtai TextToSpeech
| Knowledge Sources | |
|---|---|
| Domains | Audio, Speech_Synthesis |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
TextToSpeech is an ONNX-based text-to-speech pipeline that generates speech audio from text input using ESPnet, Kokoro, or SpeechT5 model backends.
Description
The TextToSpeech class inherits from Pipeline and provides a unified interface for converting text into speech waveforms. It automatically detects the appropriate backend (ESPnet, Kokoro, or SpeechT5) based on the files present in the model repository. The pipeline supports batching for texts longer than the model's maximum token limit, streaming audio generation for integration with LLM output, optional audio encoding to formats like WAV or FLAC, and sample rate resampling.
Usage
Use TextToSpeech when you need to synthesize spoken audio from text strings. It is suitable for applications such as voice assistants, audiobook generation, accessibility features, and any scenario where text content needs to be converted into audible speech. The streaming mode is particularly useful when integrating with real-time LLM text generation.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/audio/texttospeech.py
- Lines: 1-553
Signature
class TextToSpeech(Pipeline):
def __init__(self, path=None, maxtokens=512, rate=22050):
"""
Creates a new TextToSpeech pipeline.
Args:
path: optional model path
maxtokens: maximum number of tokens model can process, defaults to 512
rate: target sample rate, defaults to 22050
"""
def __call__(self, text, stream=False, speaker=1, encoding=None, **kwargs):
"""
Generates speech from text.
Args:
text: text|list
stream: stream response if True, defaults to False
speaker: speaker id, defaults to 1
encoding: optional audio encoding format
kwargs: additional keyword args
Returns:
list of (audio, sample rate) or list of audio depending on encoding parameter
"""
Import
from txtai.pipeline import TextToSpeech
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | No | Model path or Hugging Face model identifier. Defaults to "neuml/ljspeech-jets-onnx". Backend is auto-detected from model files (config.yaml for ESPnet, voices.json for Kokoro, otherwise SpeechT5). |
| maxtokens | int | No | Maximum number of tokens the model can process in a single batch. Defaults to 512. Texts exceeding this are split into batches. |
| rate | int | No | Target audio sample rate in Hz. Defaults to 22050. Output audio is resampled to this rate if different from the model's native rate. |
| text | str or list | Yes | Input text or list of texts to convert to speech. Strings return a single result; lists return a list of results. |
| stream | bool | No | If True, yields audio snippets incrementally as a generator. Designed for streaming LLM integration. Defaults to False. |
| speaker | int or str | No | Speaker identifier for multi-speaker models. Defaults to 1. |
| encoding | str | No | Audio encoding format (e.g., "wav", "flac"). When set, returns encoded audio bytes instead of raw NumPy arrays. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | tuple(numpy.ndarray, int) | A tuple of (audio waveform as NumPy array, sample rate) when encoding is None. |
| result | bytes | Encoded audio bytes when encoding parameter is specified (e.g., "wav"). |
| result | generator | When stream=True, yields audio results incrementally for each sentence segment. |
Usage Examples
Basic Usage
from txtai.pipeline import TextToSpeech
# Create pipeline with default ESPnet model
tts = TextToSpeech()
# Generate speech from a single string
audio, rate = tts("Hello, welcome to txtai!")
print(f"Audio shape: {audio.shape}, Sample rate: {rate}")
# Generate speech as encoded WAV bytes
wav_bytes = tts("This is encoded audio.", encoding="wav")
# Generate speech for multiple texts
results = tts(["First sentence.", "Second sentence."])
for audio, rate in results:
print(f"Audio shape: {audio.shape}, Sample rate: {rate}")
Streaming Usage
from txtai.pipeline import TextToSpeech
tts = TextToSpeech()
# Stream audio from incremental text (e.g., LLM output tokens)
tokens = ["Hello", " world.", "\n", "How ", "are ", "you ", "today."]
for audio_chunk in tts(tokens, stream=True):
audio, rate = audio_chunk
# Process or play each audio chunk incrementally
Custom Model and Speaker
from txtai.pipeline import TextToSpeech
# Use a Kokoro model with a specific speaker voice
tts = TextToSpeech(path="hexgrad/Kokoro-82M", maxtokens=510, rate=24000)
audio, rate = tts("Synthesized speech with a custom voice.", speaker="af_heart")