Principle:Ggml org Llama cpp Multimodal Input Preparation
| Aspect | Detail |
|---|---|
| Principle Name | Multimodal Input Preparation |
| Domain | Multimodal Inference |
| Scope | Preprocessing multimodal inputs: image decoding, resizing, audio resampling |
| Related Workflow | Multimodal_Inference |
Overview
Description
Before multimodal data can be fed into the encoding pipeline, raw media files must be preprocessed into a standardized internal representation called a bitmap (mtmd_bitmap). For images, this involves decoding compressed formats (JPEG, PNG, BMP, GIF), extracting pixel data, and storing it in a raw RGB byte array. For audio, this involves decoding compressed formats (WAV, MP3, FLAC), resampling to the model's expected bitrate, and converting to PCM float32 mono format.
Usage
Input preparation is performed after the multimodal context is initialized and before tokenization. Each media file (image or audio) becomes a single mtmd_bitmap object that is subsequently passed to the tokenization function. The bitmap abstraction provides a unified interface for both vision and audio inputs, despite their fundamentally different data layouts.
Theoretical Basis
Multimodal models require inputs in specific formats that match their training data distribution. The preprocessing pipeline must bridge the gap between arbitrary user-provided media files and the model's expected input format.
Image Preprocessing:
Images arrive in various compressed formats with different color spaces, bit depths, and resolutions. The preprocessing pipeline must:
- Decode the compressed format into raw pixel data using a library such as stb_image
- Convert to RGB: Regardless of source format (RGBA, grayscale, indexed color), convert to 3-channel RGB
- Store as contiguous byte array: The bitmap stores pixels in RGBRGBRGB... order with dimensions
nx * ny * 3bytes
The actual resizing, normalization, and patch extraction are handled later by the vision encoder during the encoding step, not during bitmap construction. This separation allows the same bitmap to be reused across different encoding configurations.
Audio Preprocessing:
Audio files require different preprocessing:
- Decode the compressed format (WAV, MP3, FLAC) using a library such as miniaudio
- Resample to the model's expected sample rate (e.g., 16000 Hz for Whisper-based models)
- Convert to mono: Stereo or multi-channel audio is mixed down to a single channel
- Store as float32 PCM: Audio samples are stored as 32-bit floating-point values
The target sample rate is determined by the model's audio bitrate, which is queried from the multimodal context via mtmd_get_audio_bitrate().
Format Detection:
The mtmd_helper_bitmap_init_from_buf() function automatically detects the input format by examining magic bytes at the start of the buffer:
- WAV: Starts with
RIFFfollowed byWAVEat offset 8 - MP3: Starts with
ID3or has MPEG sync word (0xFF 0xE0) - FLAC: Starts with
fLaC - Otherwise: Treated as an image and passed to stb_image for decoding
Bitmap Identity:
Each bitmap can optionally be assigned an ID string via mtmd_bitmap_set_id(). This is useful for KV cache tracking in multi-turn conversations where the same image may be referenced multiple times without requiring re-encoding.