Implementation:Ggml org Llama cpp Mtmd Audio Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, Audio
Last Updated	2026-02-15 00:00 GMT

Overview

Header declaring audio preprocessing types and interfaces for the multimodal module, supporting Whisper and Conformer audio encoder architectures.

Description

This header defines `mtmd_audio_mel` for mel spectrogram data (with length, original length, mel bin count, and data vector), `mtmd_audio_mel_filters` for filterbank matrices, and `mtmd_audio_cache` for reusable computation caches including sin/cos lookup tables, Hann window coefficients, and mel filter banks. It declares the abstract `mtmd_audio_preprocessor` base class with virtual `initialize()` and `preprocess()` methods, plus concrete implementations `mtmd_audio_preprocessor_whisper` and `mtmd_audio_preprocessor_conformer`. The header also declares `mtmd_audio_streaming_istft` for streaming inverse STFT with frame-by-frame processing and flush capabilities.

Usage

Use this header when implementing audio preprocessing for multimodal models that accept audio input. Instantiate the appropriate preprocessor subclass (Whisper or Conformer) based on the model architecture, or use `mtmd_audio_streaming_istft` for streaming audio reconstruction.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/mtmd-audio.h
Lines: 1-113

Signature

struct mtmd_audio_mel {
    int n_len;
    int n_len_org;
    int n_mel;
    std::vector<float> data;
};

struct mtmd_audio_mel_filters {
    int32_t n_mel;
    int32_t n_fft;
    std::vector<float> data;
};

struct mtmd_audio_cache {
    void fill_sin_cos_table(int n);
    void fill_hann_window(int length, bool periodic);
    void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate, float fmin = 0.0f, float fmax = -1.0f, bool slaney_area_norm = true, float scale = 1.0f);
};

struct mtmd_audio_preprocessor {
    mtmd_audio_preprocessor(const clip_ctx * ctx);
    virtual ~mtmd_audio_preprocessor() = default;
    virtual void initialize() = 0;
    virtual bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) = 0;
};

struct mtmd_audio_preprocessor_whisper : mtmd_audio_preprocessor { ... };
struct mtmd_audio_preprocessor_conformer : mtmd_audio_preprocessor { ... };

struct mtmd_audio_streaming_istft {
    mtmd_audio_streaming_istft(int n_fft, int hop_length);
    void reset();
    std::vector<float> process_frame(const float * frame_spectrum);
    std::vector<float> flush();
};

Import

#include "mtmd-audio.h"

I/O Contract

Inputs

Name	Type	Required	Description
ctx	const clip_ctx *	Yes	CLIP context providing model hyperparameters for preprocessor configuration
samples	const float *	Yes	Raw audio samples (PCM float32)
n_samples	size_t	Yes	Number of audio samples
n_fft	int	Yes	FFT size for ISTFT streaming
hop_length	int	Yes	Hop length for ISTFT overlap-add reconstruction
frame_spectrum	const float *	Yes	Single STFT frame [n_fft_bins x 2] interleaved real/imag

Outputs

Name	Type	Description
preprocess (output param)	std::vector<mtmd_audio_mel>	Mel spectrogram segments ready for encoding
preprocess (return)	bool	True on successful preprocessing
process_frame	std::vector<float>	Up to hop_length reconstructed audio samples per frame
flush	std::vector<float>	Remaining audio samples at end of stream

Usage Examples

// Create and initialize a Whisper-style audio preprocessor
mtmd_audio_preprocessor_whisper preprocessor(clip_ctx);
preprocessor.initialize();

// Preprocess raw audio into mel spectrograms
std::vector<mtmd_audio_mel> mel_output;
preprocessor.preprocess(audio_samples, n_samples, mel_output);

// Streaming ISTFT for audio reconstruction
mtmd_audio_streaming_istft istft(1280, 320);
auto samples = istft.process_frame(stft_frame);
auto remaining = istft.flush();

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment