Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Mtmd Audio

From Leeroopedia
Revision as of 12:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Mtmd_Audio.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Multimodal, Audio
Last Updated 2026-02-15 00:00 GMT

Overview

Implements audio preprocessing for multimodal models, converting raw audio samples into mel spectrograms suitable for audio encoder input.

Description

Provides `mtmd_audio_cache` methods for building sin/cos lookup tables, Hann windows, and mel filterbank matrices using Slaney scale (matching librosa defaults). Implements `mtmd_audio_preprocessor_whisper` for Whisper-style log-mel spectrogram computation via STFT with configurable FFT size, hop length, and mel bins. Also implements `mtmd_audio_preprocessor_conformer` for Conformer-style preprocessing and `mtmd_audio_streaming_istft` for streaming inverse STFT (spectrogram-to-audio conversion). The code is partially adapted from whisper.cpp.

Usage

Use this module when working with multimodal models that accept audio input (such as Ultravox). It converts raw audio waveforms into the mel spectrogram format expected by audio encoders in the CLIP-based multimodal pipeline.

Code Reference

Source Location

Signature

// Audio cache and preprocessing functions
void mtmd_audio_cache::fill_sin_cos_table(int n);
void mtmd_audio_cache::fill_hann_window(int length, bool periodic);
void mtmd_audio_cache::fill_mel_filterbank_matrix(int n_mel, int n_fft,
    int sample_rate, float fmin, float fmax, bool slaney_area_norm, float scale);

// Preprocessor implementations
bool mtmd_audio_preprocessor_whisper(/* params */);
bool mtmd_audio_preprocessor_conformer(/* params */);

// Streaming inverse STFT
void mtmd_audio_streaming_istft(/* params */);

Import

#include "mtmd-audio.h"
#include <cmath>
#include <cstdint>
#include <vector>
#include <thread>

I/O Contract

Inputs

Name Type Required Description
audio_samples float vector Yes Raw audio waveform samples (typically 16kHz mono)
n_mel int Yes Number of mel frequency bins (e.g., 80 or 128)
n_fft int Yes FFT window size
sample_rate int Yes Audio sample rate in Hz
hop_length int Yes Hop length between STFT frames
fmin float No Minimum frequency for mel filterbank (default: 0)
fmax float No Maximum frequency for mel filterbank (default: sample_rate/2)

Outputs

Name Type Description
mel_spectrogram float vector Log-mel spectrogram matrix (n_mel x n_frames), ready for audio encoder input
audio_waveform float vector Reconstructed audio waveform (from inverse STFT, for TTS use cases)

Usage Examples

#include "mtmd-audio.h"

// Initialize audio cache with lookup tables
mtmd_audio_cache cache;
cache.fill_sin_cos_table(n_fft);
cache.fill_hann_window(n_fft, true);
cache.fill_mel_filterbank_matrix(80, n_fft, 16000, 0.0f, 8000.0f, true, 1.0f);

// Compute Whisper-style mel spectrogram from raw audio
std::vector<float> audio_samples = load_audio("input.wav");
// Use mtmd_audio_preprocessor_whisper to convert to mel spectrogram

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment