Implementation:Ggml org Llama cpp TTS Outetts

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Text_To_Speech, Client
Last Updated	2026-02-15 00:00 GMT

Overview

Python-based TTS client that communicates with a llama.cpp server to generate speech audio from text using OuteTTS models.

Description

This script sends HTTP requests to a llama.cpp server API to generate audio code tokens from text prompts using the OuteTTS model. It implements `embd_to_audio` which converts model output embeddings to audio via inverse STFT: applying Hann windowing, converting frequency-domain data (magnitude + phase) to complex STFT frames via `irfft`, then using overlap-add (`fold`) to reconstruct the time-domain audio signal. The `save_wav` function writes PCM WAV files, and `process_text` normalizes input text for the model. Multithreaded frame processing is used for performance via `ThreadPoolExecutor`.

Usage

Use this script as a Python-based TTS client that communicates with two running llama.cpp server instances: one for LLM token generation and one for audio decoder embedding. It demonstrates server-mode usage of TTS models and is useful for integration with Python applications.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/tts/tts-outetts.py
Lines: 1-299

Signature

def fill_hann_window(size, periodic=True):
    """Generate a Hann window of the given size."""

def irfft(n_fft, complex_input):
    """Compute inverse real FFT."""

def fold(buffer, n_out, n_win, n_hop, n_pad):
    """Overlap-add reconstruction from windowed frames."""

def process_frame(args):
    """Process a single STFT frame (for multithreaded execution)."""

def embd_to_audio(embd, n_codes, n_embd, n_thread=4):
    """Convert model embeddings to audio via inverse STFT."""

def save_wav(filename, audio_data, sample_rate):
    """Write audio data to a WAV file."""

def process_text(text: str):
    """Normalize and tokenize input text for the OuteTTS model."""

Import

import sys
import requests
import re
import struct
import numpy as np
from concurrent.futures import ThreadPoolExecutor

I/O Contract

Inputs

Name	Type	Required	Description
host_llm	string (CLI arg)	Yes	URL of the llama.cpp LLM server (e.g., http://localhost:8080)
host_dec	string (CLI arg)	Yes	URL of the llama.cpp decoder server for embeddings
text	string (CLI arg)	Yes	Input text to convert to speech
embd	array of float	Yes	Model output embeddings (n_codes x n_embd) for audio conversion
n_codes	int	Yes	Number of spectrogram frames
n_embd	int	Yes	Embedding dimension per frame

Outputs

Name	Type	Description
output.wav	file	Generated speech audio as a 16-bit PCM WAV file at 24kHz sample rate
embd_to_audio (return)	numpy array	Reconstructed time-domain audio samples

Usage Examples

# Run TTS with two llama.cpp servers (LLM and decoder)
python tools/tts/tts-outetts.py \
    http://localhost:8080 \
    http://localhost:8081 \
    "Hello, this is a text to speech test."

# Convert embeddings to audio programmatically
import numpy as np
audio = embd_to_audio(embeddings, n_codes=100, n_embd=1280, n_thread=4)
save_wav("output.wav", audio, sample_rate=24000)

Related Pages

Principle:Ggml_org_Llama_cpp_Text_To_Speech

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment