Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Mtmd Whisper Encoder

From Leeroopedia
Knowledge Sources
Domains Multimodal, AudioEncoder
Last Updated 2025-02-15 00:00 GMT

Overview

Multimodal graph builder for the Whisper audio encoder, supporting Ultravox, Qwen2-Audio, Voxtral, and GLM-Audio projectors for speech-to-text in multimodal models.

Description

Implements clip_graph_whisper_enc::build() which constructs a ggml computation graph for the Whisper encoder. Processes audio mel spectrograms through two 1D convolution layers (ggml_conv_1d_ph) with GELU-ERF activation, transposes the output, adds learned position embeddings (selected by view to match the number of frames), then passes through a standard ViT. Optionally applies frame stacking (build_stack) for Ultravox-style models to reduce sequence length. Supports multiple projector types: Ultravox (RMS norm + SwiGLU MLP), Qwen2-Audio (linear FC + bias), Voxtral (two-layer MLP with GELU-ERF), and GLM-Audio (LayerNorm + frame stacking + MLP with BOI/EOI tokens).

Usage

Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_ULTRAVOX, PROJECTOR_TYPE_QWEN2A, PROJECTOR_TYPE_VOXTRAL, or PROJECTOR_TYPE_GLMA projector types.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/models/whisper-enc.cpp
  • Lines: 1-106

Signature

struct clip_graph_whisper_enc : public clip_graph {
    ggml_cgraph * build() override;
};

Import

#include "models.h"

I/O Contract

Inputs

Name Type Required Description
model clip_model & Yes Loaded Whisper CLIP model with conv1d and transformer weights
img clip_image_f32 & Yes Mel spectrogram tensor (nx = n_frames, ny = n_mel)

Outputs

Name Type Description
ggml_cgraph * pointer Computation graph producing audio embeddings for the LLM

Usage Examples

// Instantiated internally by clip.cpp for audio models
clip_graph_whisper_enc graph(ctx, mel_spectrogram);
ggml_cgraph * gf = graph.build();
// Output: [n_mmproj_embd, n_frames / (2 * stack_factor)] embeddings

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment