Implementation:Ggml org Llama cpp Mtmd Audio Header
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Audio |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Header declaring audio preprocessing types and interfaces for the multimodal module, supporting Whisper and Conformer audio encoder architectures.
Description
This header defines `mtmd_audio_mel` for mel spectrogram data (with length, original length, mel bin count, and data vector), `mtmd_audio_mel_filters` for filterbank matrices, and `mtmd_audio_cache` for reusable computation caches including sin/cos lookup tables, Hann window coefficients, and mel filter banks. It declares the abstract `mtmd_audio_preprocessor` base class with virtual `initialize()` and `preprocess()` methods, plus concrete implementations `mtmd_audio_preprocessor_whisper` and `mtmd_audio_preprocessor_conformer`. The header also declares `mtmd_audio_streaming_istft` for streaming inverse STFT with frame-by-frame processing and flush capabilities.
Usage
Use this header when implementing audio preprocessing for multimodal models that accept audio input. Instantiate the appropriate preprocessor subclass (Whisper or Conformer) based on the model architecture, or use `mtmd_audio_streaming_istft` for streaming audio reconstruction.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/mtmd-audio.h
- Lines: 1-113
Signature
struct mtmd_audio_mel {
int n_len;
int n_len_org;
int n_mel;
std::vector<float> data;
};
struct mtmd_audio_mel_filters {
int32_t n_mel;
int32_t n_fft;
std::vector<float> data;
};
struct mtmd_audio_cache {
void fill_sin_cos_table(int n);
void fill_hann_window(int length, bool periodic);
void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate, float fmin = 0.0f, float fmax = -1.0f, bool slaney_area_norm = true, float scale = 1.0f);
};
struct mtmd_audio_preprocessor {
mtmd_audio_preprocessor(const clip_ctx * ctx);
virtual ~mtmd_audio_preprocessor() = default;
virtual void initialize() = 0;
virtual bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) = 0;
};
struct mtmd_audio_preprocessor_whisper : mtmd_audio_preprocessor { ... };
struct mtmd_audio_preprocessor_conformer : mtmd_audio_preprocessor { ... };
struct mtmd_audio_streaming_istft {
mtmd_audio_streaming_istft(int n_fft, int hop_length);
void reset();
std::vector<float> process_frame(const float * frame_spectrum);
std::vector<float> flush();
};
Import
#include "mtmd-audio.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ctx | const clip_ctx * | Yes | CLIP context providing model hyperparameters for preprocessor configuration |
| samples | const float * | Yes | Raw audio samples (PCM float32) |
| n_samples | size_t | Yes | Number of audio samples |
| n_fft | int | Yes | FFT size for ISTFT streaming |
| hop_length | int | Yes | Hop length for ISTFT overlap-add reconstruction |
| frame_spectrum | const float * | Yes | Single STFT frame [n_fft_bins x 2] interleaved real/imag |
Outputs
| Name | Type | Description |
|---|---|---|
| preprocess (output param) | std::vector<mtmd_audio_mel> | Mel spectrogram segments ready for encoding |
| preprocess (return) | bool | True on successful preprocessing |
| process_frame | std::vector<float> | Up to hop_length reconstructed audio samples per frame |
| flush | std::vector<float> | Remaining audio samples at end of stream |
Usage Examples
// Create and initialize a Whisper-style audio preprocessor
mtmd_audio_preprocessor_whisper preprocessor(clip_ctx);
preprocessor.initialize();
// Preprocess raw audio into mel spectrograms
std::vector<mtmd_audio_mel> mel_output;
preprocessor.preprocess(audio_samples, n_samples, mel_output);
// Streaming ISTFT for audio reconstruction
mtmd_audio_streaming_istft istft(1280, 320);
auto samples = istft.process_frame(stft_frame);
auto remaining = istft.flush();