Implementation:Ollama Ollama Mtmd Whisper Encoder

Knowledge Sources	Ollama
Domains	Multimodal, AudioEncoder
Last Updated	2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the Whisper audio encoder, supporting Ultravox, Qwen2-Audio, Voxtral, and GLM-Audio projectors for speech-to-text in multimodal models.

Description

Implements clip_graph_whisper_enc::build() which constructs a ggml computation graph for the Whisper encoder. Processes audio mel spectrograms through two 1D convolution layers (ggml_conv_1d_ph) with GELU-ERF activation, transposes the output, adds learned position embeddings (selected by view to match the number of frames), then passes through a standard ViT. Optionally applies frame stacking (build_stack) for Ultravox-style models to reduce sequence length. Supports multiple projector types: Ultravox (RMS norm + SwiGLU MLP), Qwen2-Audio (linear FC + bias), Voxtral (two-layer MLP with GELU-ERF), and GLM-Audio (LayerNorm + frame stacking + MLP with BOI/EOI tokens).

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_ULTRAVOX, PROJECTOR_TYPE_QWEN2A, PROJECTOR_TYPE_VOXTRAL, or PROJECTOR_TYPE_GLMA projector types.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/models/whisper-enc.cpp
Lines: 1-106

Signature

struct clip_graph_whisper_enc : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	clip_model &	Yes	Loaded Whisper CLIP model with conv1d and transformer weights
img	clip_image_f32 &	Yes	Mel spectrogram tensor (nx = n_frames, ny = n_mel)

Outputs

Name	Type	Description
ggml_cgraph *	pointer	Computation graph producing audio embeddings for the LLM

Usage Examples

// Instantiated internally by clip.cpp for audio models
clip_graph_whisper_enc graph(ctx, mel_spectrogram);
ggml_cgraph * gf = graph.build();
// Output: [n_mmproj_embd, n_frames / (2 * stack_factor)] embeddings

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment