Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp TTS Outetts

From Leeroopedia
Knowledge Sources
Domains Text_To_Speech, Client
Last Updated 2026-02-15 00:00 GMT

Overview

Python-based TTS client that communicates with a llama.cpp server to generate speech audio from text using OuteTTS models.

Description

This script sends HTTP requests to a llama.cpp server API to generate audio code tokens from text prompts using the OuteTTS model. It implements `embd_to_audio` which converts model output embeddings to audio via inverse STFT: applying Hann windowing, converting frequency-domain data (magnitude + phase) to complex STFT frames via `irfft`, then using overlap-add (`fold`) to reconstruct the time-domain audio signal. The `save_wav` function writes PCM WAV files, and `process_text` normalizes input text for the model. Multithreaded frame processing is used for performance via `ThreadPoolExecutor`.

Usage

Use this script as a Python-based TTS client that communicates with two running llama.cpp server instances: one for LLM token generation and one for audio decoder embedding. It demonstrates server-mode usage of TTS models and is useful for integration with Python applications.

Code Reference

Source Location

Signature

def fill_hann_window(size, periodic=True):
    """Generate a Hann window of the given size."""

def irfft(n_fft, complex_input):
    """Compute inverse real FFT."""

def fold(buffer, n_out, n_win, n_hop, n_pad):
    """Overlap-add reconstruction from windowed frames."""

def process_frame(args):
    """Process a single STFT frame (for multithreaded execution)."""

def embd_to_audio(embd, n_codes, n_embd, n_thread=4):
    """Convert model embeddings to audio via inverse STFT."""

def save_wav(filename, audio_data, sample_rate):
    """Write audio data to a WAV file."""

def process_text(text: str):
    """Normalize and tokenize input text for the OuteTTS model."""

Import

import sys
import requests
import re
import struct
import numpy as np
from concurrent.futures import ThreadPoolExecutor

I/O Contract

Inputs

Name Type Required Description
host_llm string (CLI arg) Yes URL of the llama.cpp LLM server (e.g., http://localhost:8080)
host_dec string (CLI arg) Yes URL of the llama.cpp decoder server for embeddings
text string (CLI arg) Yes Input text to convert to speech
embd array of float Yes Model output embeddings (n_codes x n_embd) for audio conversion
n_codes int Yes Number of spectrogram frames
n_embd int Yes Embedding dimension per frame

Outputs

Name Type Description
output.wav file Generated speech audio as a 16-bit PCM WAV file at 24kHz sample rate
embd_to_audio (return) numpy array Reconstructed time-domain audio samples

Usage Examples

# Run TTS with two llama.cpp servers (LLM and decoder)
python tools/tts/tts-outetts.py \
    http://localhost:8080 \
    http://localhost:8081 \
    "Hello, this is a text to speech test."
# Convert embeddings to audio programmatically
import numpy as np
audio = embd_to_audio(embeddings, n_codes=100, n_embd=1280, n_thread=4)
save_wav("output.wav", audio, sample_rate=24000)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment