Implementation:Ggml org Llama cpp TTS
| Knowledge Sources | |
|---|---|
| Domains | Text_To_Speech, Audio |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Native C++ text-to-speech tool that generates audio WAV files from text using OuteTTS models with WavTokenizer audio decoding.
Description
Loads an OuteTTS LLM model (v0.2 or v0.3) and a WavTokenizer decoder model. Processes input text by constructing OuteTTS-format prompts with speaker voice data, then runs the LLM to generate audio code tokens. Feeds these tokens through the WavTokenizer decoder to produce audio embeddings, which are converted to audio waveforms via inverse STFT (magnitude/phase reconstruction, Hann windowing, overlap-add synthesis). Includes terminal visualization of the audio spectrogram using xterm256 colors and writes output as a standard WAV file.
Usage
Use this tool to generate speech audio from text input entirely within llama.cpp, without requiring Python dependencies. Supports automatic model downloading with `--tts-oute-default` for quick setup.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/tts/tts.cpp
- Lines: 1-1093
Signature
// Main entry point
int main(int argc, char ** argv);
// WAV file header structure
struct wav_header {
char riff[4] = {'R', 'I', 'F', 'F'};
uint32_t chunk_size;
char wave[4] = {'W', 'A', 'V', 'E'};
char fmt[4] = {'f', 'm', 't', ' '};
// ... PCM format fields
};
// OuteTTS version enum
enum outetts_version { OUTETTS_V0_2, OUTETTS_V0_3 };
// Terminal color utilities
static int rgb2xterm256(int r, int g, int b);
static std::string set_xterm256_foreground(int r, int g, int b);
Import
#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <nlohmann/json.hpp>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m, --model | string | Yes | Path to the OuteTTS LLM model file |
| --tts-oute-default | flag | No | Automatically download and use default OuteTTS models |
| -p, --prompt | string | Yes | Text to convert to speech |
| --tts-voice | string | No | Path to speaker voice JSON data for voice cloning |
| -o, --output | string | No | Output WAV file path (default: output.wav) |
| --tts-wavtokenizer | string | No | Path to WavTokenizer decoder model |
Outputs
| Name | Type | Description |
|---|---|---|
| WAV file | file | Standard PCM WAV audio file containing the generated speech |
| spectrogram | terminal | Visual spectrogram display using xterm256 colors (to stderr) |
| return code | int | 0 on success, non-zero on failure |
Usage Examples
# Generate speech with default models (auto-download)
./tts --tts-oute-default -p "Hello, this is a test of text to speech."
# Generate speech with specific models
./tts -m outetts-v0.3.gguf --tts-wavtokenizer wavtokenizer.gguf \
-p "The quick brown fox jumps over the lazy dog." -o speech.wav
# Use a custom speaker voice
./tts -m outetts-v0.3.gguf --tts-wavtokenizer wavtokenizer.gguf \
--tts-voice speaker.json -p "Custom voice synthesis."