Implementation:Ollama Ollama Mtmd Audio
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, AudioProcessing |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Audio preprocessing implementation that converts raw audio samples into mel spectrograms for the Whisper encoder, ported from whisper.cpp.
Description
Implements the Whisper-style audio preprocessing pipeline including FFT-based short-time Fourier transform (STFT), mel filter bank application, audio chunking for long inputs, padding to required lengths, and log-mel normalization. Uses a global cache (mtmd_audio_global_cache) for precomputed sin/cos tables, Hann window coefficients, and mel filter bank matrices. The mtmd_audio_preprocessor_whisper class supports configurable sample rates, FFT sizes, hop lengths, and mel bin counts. Includes both a naive DFT and Cooley-Tukey FFT implementation.
Usage
Called by the mtmd library when processing audio inputs to convert raw PCM float samples into mel spectrogram tensors before feeding them to the Whisper encoder.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/mtmd-audio.cpp
- Lines: 1-537
Signature
struct mtmd_audio_mel_filters {
int32_t n_mel;
int32_t n_fft;
std::vector<float> data;
};
static struct mtmd_audio_global_cache {
std::vector<float> sin_vals;
std::vector<float> cos_vals;
std::vector<float> hann_window;
mtmd_audio_mel_filters filters;
void fill_sin_cos_table(int n);
void fill_hann_window(int length, bool periodic);
void fill_mel_filterbank_matrix(int n_mel, int n_fft, int sample_rate,
float fmin, float fmax, bool slaney_area_norm, float scale);
} g_cache;
static void dft(const float * in, int N, float * out);
static void fft(float * in, int N, float * out);
Import
#include "mtmd-audio.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| audio_samples | float * | Yes | Raw PCM audio samples (mono, float32) |
| n_samples | size_t | Yes | Number of audio samples |
| sample_rate | int | Yes | Audio sample rate in Hz (e.g., 16000) |
| n_mel_bins | int | Yes | Number of mel frequency bins |
| n_fft | int | Yes | FFT window size |
Outputs
| Name | Type | Description |
|---|---|---|
| mel_spectrogram | clip_image_f32 | Mel spectrogram tensor [n_mel x n_frames] |
Usage Examples
// Audio preprocessing is triggered internally by mtmd_tokenize
// when an audio bitmap is encountered
mtmd_bitmap * audio = mtmd_bitmap_init_from_audio(n_samples, pcm_data);
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
mtmd_tokenize(ctx, chunks, text, &audio, 1);
// The audio is automatically converted to mel spectrograms internally