Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Mtmd Audio

From Leeroopedia
Knowledge Sources
Domains Multimodal, AudioProcessing
Last Updated 2025-02-15 00:00 GMT

Overview

Audio preprocessing implementation that converts raw audio samples into mel spectrograms for the Whisper encoder, ported from whisper.cpp.

Description

Implements the Whisper-style audio preprocessing pipeline including FFT-based short-time Fourier transform (STFT), mel filter bank application, audio chunking for long inputs, padding to required lengths, and log-mel normalization. Uses a global cache (mtmd_audio_global_cache) for precomputed sin/cos tables, Hann window coefficients, and mel filter bank matrices. The mtmd_audio_preprocessor_whisper class supports configurable sample rates, FFT sizes, hop lengths, and mel bin counts. Includes both a naive DFT and Cooley-Tukey FFT implementation.

Usage

Called by the mtmd library when processing audio inputs to convert raw PCM float samples into mel spectrogram tensors before feeding them to the Whisper encoder.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/mtmd-audio.cpp
  • Lines: 1-537

Signature

struct mtmd_audio_mel_filters {
    int32_t n_mel;
    int32_t n_fft;
    std::vector<float> data;
};

static struct mtmd_audio_global_cache {
    std::vector<float> sin_vals;
    std::vector<float> cos_vals;
    std::vector<float> hann_window;
    mtmd_audio_mel_filters filters;

    void fill_sin_cos_table(int n);
    void fill_hann_window(int length, bool periodic);
    void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate,
        float fmin, float fmax, bool slaney_area_norm, float scale);
} g_cache;

static void dft(const float * in, int N, float * out);
static void fft(float * in, int N, float * out);

Import

#include "mtmd-audio.h"

I/O Contract

Inputs

Name Type Required Description
audio_samples float * Yes Raw PCM audio samples (mono, float32)
n_samples size_t Yes Number of audio samples
sample_rate int Yes Audio sample rate in Hz (e.g., 16000)
n_mel_bins int Yes Number of mel frequency bins
n_fft int Yes FFT window size

Outputs

Name Type Description
mel_spectrogram clip_image_f32 Mel spectrogram tensor [n_mel x n_frames]

Usage Examples

// Audio preprocessing is triggered internally by mtmd_tokenize
// when an audio bitmap is encountered
mtmd_bitmap * audio = mtmd_bitmap_init_from_audio(n_samples, pcm_data);
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text, &audio, 1);
// The audio is automatically converted to mel spectrograms internally

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment