Implementation:Ollama Ollama Mtmd Whisper Encoder
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, AudioEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for the Whisper audio encoder, supporting Ultravox, Qwen2-Audio, Voxtral, and GLM-Audio projectors for speech-to-text in multimodal models.
Description
Implements clip_graph_whisper_enc::build() which constructs a ggml computation graph for the Whisper encoder. Processes audio mel spectrograms through two 1D convolution layers (ggml_conv_1d_ph) with GELU-ERF activation, transposes the output, adds learned position embeddings (selected by view to match the number of frames), then passes through a standard ViT. Optionally applies frame stacking (build_stack) for Ultravox-style models to reduce sequence length. Supports multiple projector types: Ultravox (RMS norm + SwiGLU MLP), Qwen2-Audio (linear FC + bias), Voxtral (two-layer MLP with GELU-ERF), and GLM-Audio (LayerNorm + frame stacking + MLP with BOI/EOI tokens).
Usage
Automatically selected when the loaded CLIP model uses PROJECTOR_TYPE_ULTRAVOX, PROJECTOR_TYPE_QWEN2A, PROJECTOR_TYPE_VOXTRAL, or PROJECTOR_TYPE_GLMA projector types.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/whisper-enc.cpp
- Lines: 1-106
Signature
struct clip_graph_whisper_enc : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded Whisper CLIP model with conv1d and transformer weights |
| img | clip_image_f32 & | Yes | Mel spectrogram tensor (nx = n_frames, ny = n_mel) |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing audio embeddings for the LLM |
Usage Examples
// Instantiated internally by clip.cpp for audio models
clip_graph_whisper_enc graph(ctx, mel_spectrogram);
ggml_cgraph * gf = graph.build();
// Output: [n_mmproj_embd, n_frames / (2 * stack_factor)] embeddings