Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai TextToSpeech

From Leeroopedia


Knowledge Sources
Domains Audio, Speech_Synthesis
Last Updated 2026-02-09 17:00 GMT

Overview

TextToSpeech is an ONNX-based text-to-speech pipeline that generates speech audio from text input using ESPnet, Kokoro, or SpeechT5 model backends.

Description

The TextToSpeech class inherits from Pipeline and provides a unified interface for converting text into speech waveforms. It automatically detects the appropriate backend (ESPnet, Kokoro, or SpeechT5) based on the files present in the model repository. The pipeline supports batching for texts longer than the model's maximum token limit, streaming audio generation for integration with LLM output, optional audio encoding to formats like WAV or FLAC, and sample rate resampling.

Usage

Use TextToSpeech when you need to synthesize spoken audio from text strings. It is suitable for applications such as voice assistants, audiobook generation, accessibility features, and any scenario where text content needs to be converted into audible speech. The streaming mode is particularly useful when integrating with real-time LLM text generation.

Code Reference

Source Location

Signature

class TextToSpeech(Pipeline):
    def __init__(self, path=None, maxtokens=512, rate=22050):
        """
        Creates a new TextToSpeech pipeline.

        Args:
            path: optional model path
            maxtokens: maximum number of tokens model can process, defaults to 512
            rate: target sample rate, defaults to 22050
        """

    def __call__(self, text, stream=False, speaker=1, encoding=None, **kwargs):
        """
        Generates speech from text.

        Args:
            text: text|list
            stream: stream response if True, defaults to False
            speaker: speaker id, defaults to 1
            encoding: optional audio encoding format
            kwargs: additional keyword args

        Returns:
            list of (audio, sample rate) or list of audio depending on encoding parameter
        """

Import

from txtai.pipeline import TextToSpeech

I/O Contract

Inputs

Name Type Required Description
path str No Model path or Hugging Face model identifier. Defaults to "neuml/ljspeech-jets-onnx". Backend is auto-detected from model files (config.yaml for ESPnet, voices.json for Kokoro, otherwise SpeechT5).
maxtokens int No Maximum number of tokens the model can process in a single batch. Defaults to 512. Texts exceeding this are split into batches.
rate int No Target audio sample rate in Hz. Defaults to 22050. Output audio is resampled to this rate if different from the model's native rate.
text str or list Yes Input text or list of texts to convert to speech. Strings return a single result; lists return a list of results.
stream bool No If True, yields audio snippets incrementally as a generator. Designed for streaming LLM integration. Defaults to False.
speaker int or str No Speaker identifier for multi-speaker models. Defaults to 1.
encoding str No Audio encoding format (e.g., "wav", "flac"). When set, returns encoded audio bytes instead of raw NumPy arrays.

Outputs

Name Type Description
result tuple(numpy.ndarray, int) A tuple of (audio waveform as NumPy array, sample rate) when encoding is None.
result bytes Encoded audio bytes when encoding parameter is specified (e.g., "wav").
result generator When stream=True, yields audio results incrementally for each sentence segment.

Usage Examples

Basic Usage

from txtai.pipeline import TextToSpeech

# Create pipeline with default ESPnet model
tts = TextToSpeech()

# Generate speech from a single string
audio, rate = tts("Hello, welcome to txtai!")
print(f"Audio shape: {audio.shape}, Sample rate: {rate}")

# Generate speech as encoded WAV bytes
wav_bytes = tts("This is encoded audio.", encoding="wav")

# Generate speech for multiple texts
results = tts(["First sentence.", "Second sentence."])
for audio, rate in results:
    print(f"Audio shape: {audio.shape}, Sample rate: {rate}")

Streaming Usage

from txtai.pipeline import TextToSpeech

tts = TextToSpeech()

# Stream audio from incremental text (e.g., LLM output tokens)
tokens = ["Hello", " world.", "\n", "How ", "are ", "you ", "today."]
for audio_chunk in tts(tokens, stream=True):
    audio, rate = audio_chunk
    # Process or play each audio chunk incrementally

Custom Model and Speaker

from txtai.pipeline import TextToSpeech

# Use a Kokoro model with a specific speaker voice
tts = TextToSpeech(path="hexgrad/Kokoro-82M", maxtokens=510, rate=24000)
audio, rate = tts("Synthesized speech with a custom voice.", speaker="af_heart")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment