Implementation:Neuml Txtai TextToSpeech

Knowledge Sources	Neuml_Txtai
Domains	Audio, Speech_Synthesis
Last Updated	2026-02-09 17:00 GMT

Overview

TextToSpeech is an ONNX-based text-to-speech pipeline that generates speech audio from text input using ESPnet, Kokoro, or SpeechT5 model backends.

Description

The TextToSpeech class inherits from Pipeline and provides a unified interface for converting text into speech waveforms. It automatically detects the appropriate backend (ESPnet, Kokoro, or SpeechT5) based on the files present in the model repository. The pipeline supports batching for texts longer than the model's maximum token limit, streaming audio generation for integration with LLM output, optional audio encoding to formats like WAV or FLAC, and sample rate resampling.

Usage

Use TextToSpeech when you need to synthesize spoken audio from text strings. It is suitable for applications such as voice assistants, audiobook generation, accessibility features, and any scenario where text content needs to be converted into audible speech. The streaming mode is particularly useful when integrating with real-time LLM text generation.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/audio/texttospeech.py
Lines: 1-553

Signature

class TextToSpeech(Pipeline):
    def __init__(self, path=None, maxtokens=512, rate=22050):
        """
        Creates a new TextToSpeech pipeline.

        Args:
            path: optional model path
            maxtokens: maximum number of tokens model can process, defaults to 512
            rate: target sample rate, defaults to 22050
        """

    def __call__(self, text, stream=False, speaker=1, encoding=None, **kwargs):
        """
        Generates speech from text.

        Args:
            text: text|list
            stream: stream response if True, defaults to False
            speaker: speaker id, defaults to 1
            encoding: optional audio encoding format
            kwargs: additional keyword args

        Returns:
            list of (audio, sample rate) or list of audio depending on encoding parameter
        """

Import

from txtai.pipeline import TextToSpeech

I/O Contract

Inputs

Name	Type	Required	Description
path	str	No	Model path or Hugging Face model identifier. Defaults to "neuml/ljspeech-jets-onnx". Backend is auto-detected from model files (config.yaml for ESPnet, voices.json for Kokoro, otherwise SpeechT5).
maxtokens	int	No	Maximum number of tokens the model can process in a single batch. Defaults to 512. Texts exceeding this are split into batches.
rate	int	No	Target audio sample rate in Hz. Defaults to 22050. Output audio is resampled to this rate if different from the model's native rate.
text	str or list	Yes	Input text or list of texts to convert to speech. Strings return a single result; lists return a list of results.
stream	bool	No	If True, yields audio snippets incrementally as a generator. Designed for streaming LLM integration. Defaults to False.
speaker	int or str	No	Speaker identifier for multi-speaker models. Defaults to 1.
encoding	str	No	Audio encoding format (e.g., "wav", "flac"). When set, returns encoded audio bytes instead of raw NumPy arrays.

Outputs

Name	Type	Description
result	tuple(numpy.ndarray, int)	A tuple of (audio waveform as NumPy array, sample rate) when encoding is None.
result	bytes	Encoded audio bytes when encoding parameter is specified (e.g., "wav").
result	generator	When stream=True, yields audio results incrementally for each sentence segment.

Usage Examples

Basic Usage

from txtai.pipeline import TextToSpeech

# Create pipeline with default ESPnet model
tts = TextToSpeech()

# Generate speech from a single string
audio, rate = tts("Hello, welcome to txtai!")
print(f"Audio shape: {audio.shape}, Sample rate: {rate}")

# Generate speech as encoded WAV bytes
wav_bytes = tts("This is encoded audio.", encoding="wav")

# Generate speech for multiple texts
results = tts(["First sentence.", "Second sentence."])
for audio, rate in results:
    print(f"Audio shape: {audio.shape}, Sample rate: {rate}")

Streaming Usage

from txtai.pipeline import TextToSpeech

tts = TextToSpeech()

# Stream audio from incremental text (e.g., LLM output tokens)
tokens = ["Hello", " world.", "\n", "How ", "are ", "you ", "today."]
for audio_chunk in tts(tokens, stream=True):
    audio, rate = audio_chunk
    # Process or play each audio chunk incrementally

Custom Model and Speaker

from txtai.pipeline import TextToSpeech

# Use a Kokoro model with a specific speaker voice
tts = TextToSpeech(path="hexgrad/Kokoro-82M", maxtokens=510, rate=24000)
audio, rate = tts("Synthesized speech with a custom voice.", speaker="af_heart")

Related Pages

Principle:Neuml_Txtai_Speech_Synthesis

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment