Implementation:Ggml org Llama cpp Mtmd Audio
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Audio |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements audio preprocessing for multimodal models, converting raw audio samples into mel spectrograms suitable for audio encoder input.
Description
Provides `mtmd_audio_cache` methods for building sin/cos lookup tables, Hann windows, and mel filterbank matrices using Slaney scale (matching librosa defaults). Implements `mtmd_audio_preprocessor_whisper` for Whisper-style log-mel spectrogram computation via STFT with configurable FFT size, hop length, and mel bins. Also implements `mtmd_audio_preprocessor_conformer` for Conformer-style preprocessing and `mtmd_audio_streaming_istft` for streaming inverse STFT (spectrogram-to-audio conversion). The code is partially adapted from whisper.cpp.
Usage
Use this module when working with multimodal models that accept audio input (such as Ultravox). It converts raw audio waveforms into the mel spectrogram format expected by audio encoders in the CLIP-based multimodal pipeline.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/mtmd-audio.cpp
- Lines: 1-730
Signature
// Audio cache and preprocessing functions
void mtmd_audio_cache::fill_sin_cos_table(int n);
void mtmd_audio_cache::fill_hann_window(int length, bool periodic);
void mtmd_audio_cache::fill_mel_filterbank_matrix(int n_mel, int n_fft,
int sample_rate, float fmin, float fmax, bool slaney_area_norm, float scale);
// Preprocessor implementations
bool mtmd_audio_preprocessor_whisper(/* params */);
bool mtmd_audio_preprocessor_conformer(/* params */);
// Streaming inverse STFT
void mtmd_audio_streaming_istft(/* params */);
Import
#include "mtmd-audio.h"
#include <cmath>
#include <cstdint>
#include <vector>
#include <thread>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| audio_samples | float vector | Yes | Raw audio waveform samples (typically 16kHz mono) |
| n_mel | int | Yes | Number of mel frequency bins (e.g., 80 or 128) |
| n_fft | int | Yes | FFT window size |
| sample_rate | int | Yes | Audio sample rate in Hz |
| hop_length | int | Yes | Hop length between STFT frames |
| fmin | float | No | Minimum frequency for mel filterbank (default: 0) |
| fmax | float | No | Maximum frequency for mel filterbank (default: sample_rate/2) |
Outputs
| Name | Type | Description |
|---|---|---|
| mel_spectrogram | float vector | Log-mel spectrogram matrix (n_mel x n_frames), ready for audio encoder input |
| audio_waveform | float vector | Reconstructed audio waveform (from inverse STFT, for TTS use cases) |
Usage Examples
#include "mtmd-audio.h"
// Initialize audio cache with lookup tables
mtmd_audio_cache cache;
cache.fill_sin_cos_table(n_fft);
cache.fill_hann_window(n_fft, true);
cache.fill_mel_filterbank_matrix(80, n_fft, 16000, 0.0f, 8000.0f, true, 1.0f);
// Compute Whisper-style mel spectrogram from raw audio
std::vector<float> audio_samples = load_audio("input.wav");
// Use mtmd_audio_preprocessor_whisper to convert to mel spectrogram