Principle:Neuml Txtai Speech Synthesis
| Knowledge Sources | |
|---|---|
| Domains | Text_To_Speech, Audio_Processing |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Speech Synthesis is txtai's text-to-speech pipeline that converts text into audio waveforms using ONNX-optimized models, supporting multiple voices, languages, and streaming output for real-time audio generation.
Description
txtai's TextToSpeech pipeline provides a unified interface for converting text into spoken audio. The pipeline wraps multiple TTS model architectures behind a common callable interface:
- ESPnet -- Open-source end-to-end speech processing toolkit with multilingual support
- Kokoro -- Lightweight high-quality TTS optimized for fast inference
- SpeechT5 -- Microsoft's unified speech-text model with speaker adaptation capabilities
Each model is loaded in ONNX format for optimized inference, eliminating the need for the full PyTorch runtime at serving time and enabling deployment on CPU-only environments with competitive performance.
The TTS pipeline follows a multi-stage architecture. First, the input text undergoes text normalization: numbers are expanded to words, abbreviations are resolved, and punctuation is converted to pause markers. Next, the normalized text is fed to an acoustic model that generates a mel spectrogram -- a time-frequency representation of the speech signal. The acoustic model conditions on a speaker embedding vector that encodes voice characteristics (pitch, timbre, speaking rate), enabling multi-voice synthesis from a single model. Finally, a vocoder (typically HiFi-GAN or a similar neural waveform generator) converts the mel spectrogram into a raw audio waveform at the target sample rate (typically 22050 Hz).
The pipeline supports both batch and streaming modes. In batch mode, the entire input text is synthesized at once and returned as a NumPy array or written to a WAV file. In streaming mode, the text is split into sentence-level chunks, each chunk is synthesized independently, and the resulting audio segments are yielded as they become available. This enables real-time audio playback for long texts without waiting for the entire synthesis to complete. Speaker voice selection is controlled by passing a speaker embedding or speaker id, allowing applications to offer multiple distinct voices from a single model deployment.
Model selection is handled through configuration parameters at pipeline construction time. The pipeline auto-detects the model architecture from the model path or repository name and loads the appropriate ONNX session with the matching tokenizer and vocoder. Users can also specify explicit model components (acoustic model path, vocoder path, speaker embeddings file) for custom configurations that mix components from different sources.
Usage
Use the Speech Synthesis pipeline when you need to generate audio from text for accessibility features, voice assistants, audiobook generation, or any application requiring spoken output. Choose ESPnet models for multilingual support and research-grade quality, Kokoro for lightweight fast synthesis, or SpeechT5 for English-focused tasks with speaker adaptation. Enable streaming mode for interactive applications where latency matters more than batch throughput. The pipeline integrates with txtai's workflow system, enabling chains like retrieve documents -> summarize -> synthesize speech.
Theoretical Basis
1. TTS Pipeline Stages: Modern neural TTS systems decompose speech synthesis into three stages:
- Text analysis -- Grapheme-to-phoneme conversion, prosody prediction, and text normalization that transforms raw text into a phoneme sequence with duration and pitch annotations
- Acoustic modeling -- Generating a mel spectrogram from the phoneme sequence, typically using an autoregressive (Tacotron) or non-autoregressive (FastSpeech) architecture
- Vocoding -- Converting the mel spectrogram to a time-domain waveform using a neural vocoder (WaveNet, WaveGlow, or HiFi-GAN)
2. ONNX Runtime Optimization: By exporting TTS models to ONNX (Open Neural Network Exchange) format, txtai leverages the ONNX Runtime's graph-level optimizations: operator fusion (merging adjacent operations like MatMul+Add into a single kernel), constant folding (pre-computing static subgraphs at load time), and hardware-specific kernel selection (AVX-512 on CPU, CUDA on GPU, CoreML on Apple Silicon). This typically yields 2-4x inference speedup over PyTorch eager mode, making real-time synthesis feasible on commodity hardware.
3. Speaker Embeddings for Voice Cloning: Multi-speaker TTS models condition the acoustic model on a fixed-dimensional speaker embedding vector e_s (typically 256 or 512 dimensions) that captures voice identity. During synthesis, e_s is concatenated with or added to the hidden states of the acoustic model at each time step:
h_t = f(h_{t-1}, x_t, e_s)
By extracting speaker embeddings from reference audio using a speaker verification model, the system can clone new voices without retraining the TTS model. The quality of voice cloning depends on the diversity of speakers in the training data and the expressiveness of the embedding space.
4. Mel Spectrogram Representation: The mel spectrogram is a 2D matrix of shape (T, n_mels) where T is the number of time frames and n_mels (typically 80) is the number of mel-frequency bins. The mel scale is a perceptual frequency scale that approximates the human auditory system's non-linear frequency response, spacing lower frequencies more densely than higher frequencies. Log-scaled mel spectrograms serve as the standard intermediate representation between acoustic models and vocoders.
5. Streaming Chunked Synthesis: For streaming output, the input text is segmented at sentence boundaries using rule-based or model-based sentence detection. Each segment is independently synthesized, producing an audio chunk of variable duration. Chunks are concatenated with short crossfade windows (typically 10-50ms) to avoid audible clicks at boundaries. The first chunk is available for playback within the latency of synthesizing a single sentence, typically 100-500ms depending on model size and hardware.