Workflow:Ggml org Llama cpp Multimodal Inference

Knowledge Sources	llama.cpp Multimodal Documentation MTMD Tool
Domains	LLMs, Multimodal, Vision, Inference
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for running inference with multimodal inputs (text combined with images or audio) using GGUF language models and CLIP-based projection models.

Description

This workflow enables language models to process and reason about non-text inputs such as images and audio alongside text prompts. It uses a two-model architecture: a main language model for text generation and a multimodal projector (mmproj) that encodes visual or audio inputs into the language model's embedding space. The projector is based on CLIP (Contrastive Language-Image Pre-training) and supports multiple vision architectures including LLaVA, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, and others. Audio support is available for speech-capable models like MiniCPM-o and Qwen2-Audio.

Usage

Execute this workflow when you need a language model to understand and respond to visual content (images, screenshots, diagrams) or audio content alongside text instructions. This is appropriate for image captioning, visual question answering, document understanding, and audio transcription or analysis tasks.

Execution Steps

Step 1: Obtain Compatible Models

Acquire both the main language model (in GGUF format) and its corresponding multimodal projector (mmproj) file. The mmproj must be specifically trained or converted for the target language model architecture.

Key considerations:

The language model and projector must be from the same model family
Many vision-language models provide the mmproj as a separate GGUF file
Supported architectures include LLaVA 1.5/1.6, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, Qwen2-VL
Some models support both image and audio input, while others support only images

Step 2: Load Language Model

Load the main GGUF language model with standard model loading parameters. Configure GPU offloading and context size as appropriate for the combined text and media input length.

Key considerations:

Context size must be large enough for both text tokens and projected media embeddings
Image embeddings can consume significant context (hundreds of tokens per image)
GPU offloading accelerates both language and vision processing

Step 3: Load Multimodal Projector

Initialize the multimodal context by loading the projector (mmproj) model using the dedicated multimodal initialization API. The projector handles preprocessing, encoding, and projection of media inputs into the language model's embedding space.

Key considerations:

The projector can optionally be kept on CPU to save GPU memory
Different projector architectures handle different image resolutions and formats
The projector determines the number of embedding tokens per image or audio segment

Step 4: Prepare Multimodal Input

Load and preprocess the media files (images or audio) using the multimodal preprocessing pipeline. Images are decoded, resized, and normalized according to the projector's requirements. Audio files are converted to the expected sample rate and format.

Key considerations:

Supported image formats include PNG, JPEG, and other common formats
Images may be automatically resized or tiled based on the model's requirements
Audio is typically resampled to 16kHz mono
Multiple images or mixed media types may be supported depending on the model

Step 5: Encode and Generate

Encode the media inputs through the projector to produce embedding vectors, insert these embeddings into the token stream at the appropriate positions, and run the language model's generation loop to produce text output that references the multimodal context.

Key considerations:

Media embeddings replace special image/audio placeholder tokens in the input sequence
The language model generates text conditioned on both the text prompt and media embeddings
Interactive mode allows follow-up questions about the same media content
Chat templates handle the placement of media tokens within conversation format

Execution Diagram

GitHub URL

Workflow Repository