Implementation:Ggml org Llama cpp TTS Outetts
| Knowledge Sources | |
|---|---|
| Domains | Text_To_Speech, Client |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Python-based TTS client that communicates with a llama.cpp server to generate speech audio from text using OuteTTS models.
Description
This script sends HTTP requests to a llama.cpp server API to generate audio code tokens from text prompts using the OuteTTS model. It implements `embd_to_audio` which converts model output embeddings to audio via inverse STFT: applying Hann windowing, converting frequency-domain data (magnitude + phase) to complex STFT frames via `irfft`, then using overlap-add (`fold`) to reconstruct the time-domain audio signal. The `save_wav` function writes PCM WAV files, and `process_text` normalizes input text for the model. Multithreaded frame processing is used for performance via `ThreadPoolExecutor`.
Usage
Use this script as a Python-based TTS client that communicates with two running llama.cpp server instances: one for LLM token generation and one for audio decoder embedding. It demonstrates server-mode usage of TTS models and is useful for integration with Python applications.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/tts/tts-outetts.py
- Lines: 1-299
Signature
def fill_hann_window(size, periodic=True):
"""Generate a Hann window of the given size."""
def irfft(n_fft, complex_input):
"""Compute inverse real FFT."""
def fold(buffer, n_out, n_win, n_hop, n_pad):
"""Overlap-add reconstruction from windowed frames."""
def process_frame(args):
"""Process a single STFT frame (for multithreaded execution)."""
def embd_to_audio(embd, n_codes, n_embd, n_thread=4):
"""Convert model embeddings to audio via inverse STFT."""
def save_wav(filename, audio_data, sample_rate):
"""Write audio data to a WAV file."""
def process_text(text: str):
"""Normalize and tokenize input text for the OuteTTS model."""
Import
import sys
import requests
import re
import struct
import numpy as np
from concurrent.futures import ThreadPoolExecutor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| host_llm | string (CLI arg) | Yes | URL of the llama.cpp LLM server (e.g., http://localhost:8080) |
| host_dec | string (CLI arg) | Yes | URL of the llama.cpp decoder server for embeddings |
| text | string (CLI arg) | Yes | Input text to convert to speech |
| embd | array of float | Yes | Model output embeddings (n_codes x n_embd) for audio conversion |
| n_codes | int | Yes | Number of spectrogram frames |
| n_embd | int | Yes | Embedding dimension per frame |
Outputs
| Name | Type | Description |
|---|---|---|
| output.wav | file | Generated speech audio as a 16-bit PCM WAV file at 24kHz sample rate |
| embd_to_audio (return) | numpy array | Reconstructed time-domain audio samples |
Usage Examples
# Run TTS with two llama.cpp servers (LLM and decoder)
python tools/tts/tts-outetts.py \
http://localhost:8080 \
http://localhost:8081 \
"Hello, this is a text to speech test."
# Convert embeddings to audio programmatically
import numpy as np
audio = embd_to_audio(embeddings, n_codes=100, n_embd=1280, n_thread=4)
save_wav("output.wav", audio, sample_rate=24000)