Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Mtmd Audio Header

From Leeroopedia
Knowledge Sources
Domains Multimodal, Audio
Last Updated 2026-02-15 00:00 GMT

Overview

Header declaring audio preprocessing types and interfaces for the multimodal module, supporting Whisper and Conformer audio encoder architectures.

Description

This header defines `mtmd_audio_mel` for mel spectrogram data (with length, original length, mel bin count, and data vector), `mtmd_audio_mel_filters` for filterbank matrices, and `mtmd_audio_cache` for reusable computation caches including sin/cos lookup tables, Hann window coefficients, and mel filter banks. It declares the abstract `mtmd_audio_preprocessor` base class with virtual `initialize()` and `preprocess()` methods, plus concrete implementations `mtmd_audio_preprocessor_whisper` and `mtmd_audio_preprocessor_conformer`. The header also declares `mtmd_audio_streaming_istft` for streaming inverse STFT with frame-by-frame processing and flush capabilities.

Usage

Use this header when implementing audio preprocessing for multimodal models that accept audio input. Instantiate the appropriate preprocessor subclass (Whisper or Conformer) based on the model architecture, or use `mtmd_audio_streaming_istft` for streaming audio reconstruction.

Code Reference

Source Location

Signature

struct mtmd_audio_mel {
    int n_len;
    int n_len_org;
    int n_mel;
    std::vector<float> data;
};

struct mtmd_audio_mel_filters {
    int32_t n_mel;
    int32_t n_fft;
    std::vector<float> data;
};

struct mtmd_audio_cache {
    void fill_sin_cos_table(int n);
    void fill_hann_window(int length, bool periodic);
    void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate, float fmin = 0.0f, float fmax = -1.0f, bool slaney_area_norm = true, float scale = 1.0f);
};

struct mtmd_audio_preprocessor {
    mtmd_audio_preprocessor(const clip_ctx * ctx);
    virtual ~mtmd_audio_preprocessor() = default;
    virtual void initialize() = 0;
    virtual bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) = 0;
};

struct mtmd_audio_preprocessor_whisper : mtmd_audio_preprocessor { ... };
struct mtmd_audio_preprocessor_conformer : mtmd_audio_preprocessor { ... };

struct mtmd_audio_streaming_istft {
    mtmd_audio_streaming_istft(int n_fft, int hop_length);
    void reset();
    std::vector<float> process_frame(const float * frame_spectrum);
    std::vector<float> flush();
};

Import

#include "mtmd-audio.h"

I/O Contract

Inputs

Name Type Required Description
ctx const clip_ctx * Yes CLIP context providing model hyperparameters for preprocessor configuration
samples const float * Yes Raw audio samples (PCM float32)
n_samples size_t Yes Number of audio samples
n_fft int Yes FFT size for ISTFT streaming
hop_length int Yes Hop length for ISTFT overlap-add reconstruction
frame_spectrum const float * Yes Single STFT frame [n_fft_bins x 2] interleaved real/imag

Outputs

Name Type Description
preprocess (output param) std::vector<mtmd_audio_mel> Mel spectrogram segments ready for encoding
preprocess (return) bool True on successful preprocessing
process_frame std::vector<float> Up to hop_length reconstructed audio samples per frame
flush std::vector<float> Remaining audio samples at end of stream

Usage Examples

// Create and initialize a Whisper-style audio preprocessor
mtmd_audio_preprocessor_whisper preprocessor(clip_ctx);
preprocessor.initialize();

// Preprocess raw audio into mel spectrograms
std::vector<mtmd_audio_mel> mel_output;
preprocessor.preprocess(audio_samples, n_samples, mel_output);

// Streaming ISTFT for audio reconstruction
mtmd_audio_streaming_istft istft(1280, 320);
auto samples = istft.process_frame(stft_frame);
auto remaining = istft.flush();

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment