Principle:Ggml org Llama cpp Multimodal Input Preparation

Aspect	Detail
Principle Name	Multimodal Input Preparation
Domain	Multimodal Inference
Scope	Preprocessing multimodal inputs: image decoding, resizing, audio resampling
Related Workflow	Multimodal_Inference

Overview

Description

Before multimodal data can be fed into the encoding pipeline, raw media files must be preprocessed into a standardized internal representation called a bitmap (mtmd_bitmap). For images, this involves decoding compressed formats (JPEG, PNG, BMP, GIF), extracting pixel data, and storing it in a raw RGB byte array. For audio, this involves decoding compressed formats (WAV, MP3, FLAC), resampling to the model's expected bitrate, and converting to PCM float32 mono format.

Usage

Input preparation is performed after the multimodal context is initialized and before tokenization. Each media file (image or audio) becomes a single mtmd_bitmap object that is subsequently passed to the tokenization function. The bitmap abstraction provides a unified interface for both vision and audio inputs, despite their fundamentally different data layouts.

Theoretical Basis

Multimodal models require inputs in specific formats that match their training data distribution. The preprocessing pipeline must bridge the gap between arbitrary user-provided media files and the model's expected input format.

Image Preprocessing:

Images arrive in various compressed formats with different color spaces, bit depths, and resolutions. The preprocessing pipeline must:

Decode the compressed format into raw pixel data using a library such as stb_image
Convert to RGB: Regardless of source format (RGBA, grayscale, indexed color), convert to 3-channel RGB
Store as contiguous byte array: The bitmap stores pixels in RGBRGBRGB... order with dimensions nx * ny * 3 bytes

The actual resizing, normalization, and patch extraction are handled later by the vision encoder during the encoding step, not during bitmap construction. This separation allows the same bitmap to be reused across different encoding configurations.

Audio Preprocessing:

Audio files require different preprocessing:

Decode the compressed format (WAV, MP3, FLAC) using a library such as miniaudio
Resample to the model's expected sample rate (e.g., 16000 Hz for Whisper-based models)
Convert to mono: Stereo or multi-channel audio is mixed down to a single channel
Store as float32 PCM: Audio samples are stored as 32-bit floating-point values

The target sample rate is determined by the model's audio bitrate, which is queried from the multimodal context via mtmd_get_audio_bitrate().

Format Detection:

The mtmd_helper_bitmap_init_from_buf() function automatically detects the input format by examining magic bytes at the start of the buffer:

WAV: Starts with RIFF followed by WAVE at offset 8
MP3: Starts with ID3 or has MPEG sync word (0xFF 0xE0)
FLAC: Starts with fLaC
Otherwise: Treated as an image and passed to stb_image for decoding

Bitmap Identity:

Each bitmap can optionally be assigned an ID string via mtmd_bitmap_set_id(). This is useful for KV cache tracking in multi-turn conversations where the same image may be referenced multiple times without requiring re-encoding.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment