Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ggml org Llama cpp Multimodal Inference

From Leeroopedia
Knowledge Sources
Domains LLMs, Multimodal, Vision, Inference
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for running inference with multimodal inputs (text combined with images or audio) using GGUF language models and CLIP-based projection models.

Description

This workflow enables language models to process and reason about non-text inputs such as images and audio alongside text prompts. It uses a two-model architecture: a main language model for text generation and a multimodal projector (mmproj) that encodes visual or audio inputs into the language model's embedding space. The projector is based on CLIP (Contrastive Language-Image Pre-training) and supports multiple vision architectures including LLaVA, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, and others. Audio support is available for speech-capable models like MiniCPM-o and Qwen2-Audio.

Usage

Execute this workflow when you need a language model to understand and respond to visual content (images, screenshots, diagrams) or audio content alongside text instructions. This is appropriate for image captioning, visual question answering, document understanding, and audio transcription or analysis tasks.

Execution Steps

Step 1: Obtain Compatible Models

Acquire both the main language model (in GGUF format) and its corresponding multimodal projector (mmproj) file. The mmproj must be specifically trained or converted for the target language model architecture.

Key considerations:

  • The language model and projector must be from the same model family
  • Many vision-language models provide the mmproj as a separate GGUF file
  • Supported architectures include LLaVA 1.5/1.6, Gemma 3, MiniCPM-V, GraniteVision, MobileVLM, Qwen2-VL
  • Some models support both image and audio input, while others support only images

Step 2: Load Language Model

Load the main GGUF language model with standard model loading parameters. Configure GPU offloading and context size as appropriate for the combined text and media input length.

Key considerations:

  • Context size must be large enough for both text tokens and projected media embeddings
  • Image embeddings can consume significant context (hundreds of tokens per image)
  • GPU offloading accelerates both language and vision processing

Step 3: Load Multimodal Projector

Initialize the multimodal context by loading the projector (mmproj) model using the dedicated multimodal initialization API. The projector handles preprocessing, encoding, and projection of media inputs into the language model's embedding space.

Key considerations:

  • The projector can optionally be kept on CPU to save GPU memory
  • Different projector architectures handle different image resolutions and formats
  • The projector determines the number of embedding tokens per image or audio segment

Step 4: Prepare Multimodal Input

Load and preprocess the media files (images or audio) using the multimodal preprocessing pipeline. Images are decoded, resized, and normalized according to the projector's requirements. Audio files are converted to the expected sample rate and format.

Key considerations:

  • Supported image formats include PNG, JPEG, and other common formats
  • Images may be automatically resized or tiled based on the model's requirements
  • Audio is typically resampled to 16kHz mono
  • Multiple images or mixed media types may be supported depending on the model

Step 5: Encode and Generate

Encode the media inputs through the projector to produce embedding vectors, insert these embeddings into the token stream at the appropriate positions, and run the language model's generation loop to produce text output that references the multimodal context.

Key considerations:

  • Media embeddings replace special image/audio placeholder tokens in the input sequence
  • The language model generates text conditioned on both the text prompt and media embeddings
  • Interactive mode allows follow-up questions about the same media content
  • Chat templates handle the placement of media tokens within conversation format

Execution Diagram

GitHub URL

Workflow Repository