Implementation:Ggml org Llama cpp Mtmd Audio

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, Audio
Last Updated	2026-02-15 00:00 GMT

Overview

Implements audio preprocessing for multimodal models, converting raw audio samples into mel spectrograms suitable for audio encoder input.

Description

Provides `mtmd_audio_cache` methods for building sin/cos lookup tables, Hann windows, and mel filterbank matrices using Slaney scale (matching librosa defaults). Implements `mtmd_audio_preprocessor_whisper` for Whisper-style log-mel spectrogram computation via STFT with configurable FFT size, hop length, and mel bins. Also implements `mtmd_audio_preprocessor_conformer` for Conformer-style preprocessing and `mtmd_audio_streaming_istft` for streaming inverse STFT (spectrogram-to-audio conversion). The code is partially adapted from whisper.cpp.

Usage

Use this module when working with multimodal models that accept audio input (such as Ultravox). It converts raw audio waveforms into the mel spectrogram format expected by audio encoders in the CLIP-based multimodal pipeline.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/mtmd-audio.cpp
Lines: 1-730

Signature

// Audio cache and preprocessing functions
void mtmd_audio_cache::fill_sin_cos_table(int n);
void mtmd_audio_cache::fill_hann_window(int length, bool periodic);
void mtmd_audio_cache::fill_mel_filterbank_matrix(int n_mel, int n_fft,
    int sample_rate, float fmin, float fmax, bool slaney_area_norm, float scale);

// Preprocessor implementations
bool mtmd_audio_preprocessor_whisper(/* params */);
bool mtmd_audio_preprocessor_conformer(/* params */);

// Streaming inverse STFT
void mtmd_audio_streaming_istft(/* params */);

Import

#include "mtmd-audio.h"
#include <cmath>
#include <cstdint>
#include <vector>
#include <thread>

I/O Contract

Inputs

Name	Type	Required	Description
audio_samples	float vector	Yes	Raw audio waveform samples (typically 16kHz mono)
n_mel	int	Yes	Number of mel frequency bins (e.g., 80 or 128)
n_fft	int	Yes	FFT window size
sample_rate	int	Yes	Audio sample rate in Hz
hop_length	int	Yes	Hop length between STFT frames
fmin	float	No	Minimum frequency for mel filterbank (default: 0)
fmax	float	No	Maximum frequency for mel filterbank (default: sample_rate/2)

Outputs

Name	Type	Description
mel_spectrogram	float vector	Log-mel spectrogram matrix (n_mel x n_frames), ready for audio encoder input
audio_waveform	float vector	Reconstructed audio waveform (from inverse STFT, for TTS use cases)

Usage Examples

#include "mtmd-audio.h"

// Initialize audio cache with lookup tables
mtmd_audio_cache cache;
cache.fill_sin_cos_table(n_fft);
cache.fill_hann_window(n_fft, true);
cache.fill_mel_filterbank_matrix(80, n_fft, 16000, 0.0f, 8000.0f, true, 1.0f);

// Compute Whisper-style mel spectrogram from raw audio
std::vector<float> audio_samples = load_audio("input.wav");
// Use mtmd_audio_preprocessor_whisper to convert to mel spectrogram

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment