Workflow:Ggml org Llama cpp Multimodal Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal, Vision, Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for running inference with multimodal inputs (text combined with images or audio) using GGUF language models and CLIP-based projection models.
Description
This workflow enables language models to process and reason about non-text inputs such as images and audio alongside text prompts. It uses a two-model architecture: a main language model for text generation and a multimodal projector (mmproj) that encodes visual or audio inputs into the language model's embedding space. The projector is based on CLIP (Contrastive Language-Image Pre-training) and supports multiple vision architectures including LLaVA, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, and others. Audio support is available for speech-capable models like MiniCPM-o and Qwen2-Audio.
Usage
Execute this workflow when you need a language model to understand and respond to visual content (images, screenshots, diagrams) or audio content alongside text instructions. This is appropriate for image captioning, visual question answering, document understanding, and audio transcription or analysis tasks.
Execution Steps
Step 1: Obtain Compatible Models
Acquire both the main language model (in GGUF format) and its corresponding multimodal projector (mmproj) file. The mmproj must be specifically trained or converted for the target language model architecture.
Key considerations:
- The language model and projector must be from the same model family
- Many vision-language models provide the mmproj as a separate GGUF file
- Supported architectures include LLaVA 1.5/1.6, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, Qwen2-VL
- Some models support both image and audio input, while others support only images
Step 2: Load Language Model
Load the main GGUF language model with standard model loading parameters. Configure GPU offloading and context size as appropriate for the combined text and media input length.
Key considerations:
- Context size must be large enough for both text tokens and projected media embeddings
- Image embeddings can consume significant context (hundreds of tokens per image)
- GPU offloading accelerates both language and vision processing
Step 3: Load Multimodal Projector
Initialize the multimodal context by loading the projector (mmproj) model using the dedicated multimodal initialization API. The projector handles preprocessing, encoding, and projection of media inputs into the language model's embedding space.
Key considerations:
- The projector can optionally be kept on CPU to save GPU memory
- Different projector architectures handle different image resolutions and formats
- The projector determines the number of embedding tokens per image or audio segment
Step 4: Prepare Multimodal Input
Load and preprocess the media files (images or audio) using the multimodal preprocessing pipeline. Images are decoded, resized, and normalized according to the projector's requirements. Audio files are converted to the expected sample rate and format.
Key considerations:
- Supported image formats include PNG, JPEG, and other common formats
- Images may be automatically resized or tiled based on the model's requirements
- Audio is typically resampled to 16kHz mono
- Multiple images or mixed media types may be supported depending on the model
Step 5: Encode and Generate
Encode the media inputs through the projector to produce embedding vectors, insert these embeddings into the token stream at the appropriate positions, and run the language model's generation loop to produce text output that references the multimodal context.
Key considerations:
- Media embeddings replace special image/audio placeholder tokens in the input sequence
- The language model generates text conditioned on both the text prompt and media embeddings
- Interactive mode allows follow-up questions about the same media content
- Chat templates handle the placement of media tokens within conversation format