Implementation:Ollama Ollama Mtmd Audio

Knowledge Sources	Ollama
Domains	Multimodal, AudioProcessing
Last Updated	2025-02-15 00:00 GMT

Overview

Audio preprocessing implementation that converts raw audio samples into mel spectrograms for the Whisper encoder, ported from whisper.cpp.

Description

Implements the Whisper-style audio preprocessing pipeline including FFT-based short-time Fourier transform (STFT), mel filter bank application, audio chunking for long inputs, padding to required lengths, and log-mel normalization. Uses a global cache (mtmd_audio_global_cache) for precomputed sin/cos tables, Hann window coefficients, and mel filter bank matrices. The mtmd_audio_preprocessor_whisper class supports configurable sample rates, FFT sizes, hop lengths, and mel bin counts. Includes both a naive DFT and Cooley-Tukey FFT implementation.

Usage

Called by the mtmd library when processing audio inputs to convert raw PCM float samples into mel spectrogram tensors before feeding them to the Whisper encoder.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/mtmd-audio.cpp
Lines: 1-537

Signature

struct mtmd_audio_mel_filters {
    int32_t n_mel;
    int32_t n_fft;
    std::vector<float> data;
};

static struct mtmd_audio_global_cache {
    std::vector<float> sin_vals;
    std::vector<float> cos_vals;
    std::vector<float> hann_window;
    mtmd_audio_mel_filters filters;

    void fill_sin_cos_table(int n);
    void fill_hann_window(int length, bool periodic);
    void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate,
        float fmin, float fmax, bool slaney_area_norm, float scale);
} g_cache;

static void dft(const float * in, int N, float * out);
static void fft(float * in, int N, float * out);

Import

#include "mtmd-audio.h"

I/O Contract

Inputs

Name	Type	Required	Description
audio_samples	float *	Yes	Raw PCM audio samples (mono, float32)
n_samples	size_t	Yes	Number of audio samples
sample_rate	int	Yes	Audio sample rate in Hz (e.g., 16000)
n_mel_bins	int	Yes	Number of mel frequency bins
n_fft	int	Yes	FFT window size

Outputs

Name	Type	Description
mel_spectrogram	clip_image_f32	Mel spectrogram tensor [n_mel x n_frames]

Usage Examples

// Audio preprocessing is triggered internally by mtmd_tokenize
// when an audio bitmap is encountered
mtmd_bitmap * audio = mtmd_bitmap_init_from_audio(n_samples, pcm_data);
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text, &audio, 1);
// The audio is automatically converted to mel spectrograms internally

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment