Principle:Ggml org Llama cpp Multimodal Model Acquisition
| Aspect | Detail |
|---|---|
| Principle Name | Multimodal Model Acquisition |
| Domain | Multimodal Inference |
| Scope | Obtaining compatible multimodal model files for llama.cpp |
| Related Workflow | Multimodal_Inference |
Overview
Description
Multimodal inference in llama.cpp requires two separate model files that work in tandem: a text/language model in GGUF format and a multimodal projector (mmproj) in GGUF format. The text model handles language understanding and generation, while the projector bridges the gap between a vision or audio encoder and the text model's embedding space. Both files must be obtained and matched correctly for multimodal inference to function.
Usage
Before any multimodal inference pipeline can begin, the user must acquire two compatible GGUF files:
- A text model GGUF (e.g.,
model.gguf) that serves as the language backbone - A multimodal projector GGUF (e.g.,
mmproj.gguf) that was trained or fine-tuned alongside the text model to project vision or audio features into the text model's embedding space
These files are typically hosted on model repositories such as Hugging Face, and must originate from the same model family to ensure compatibility.
Theoretical Basis
Multimodal large language models (MLLMs) extend standard text-only LLMs by incorporating additional sensory modalities such as vision and audio. The dominant architecture pattern involves three components:
1. Modality-Specific Encoder: A pre-trained encoder (e.g., a Vision Transformer for images, or a Whisper-style encoder for audio) that converts raw sensory input into high-dimensional feature representations. This encoder is typically not distributed separately in llama.cpp; its weights are embedded within the mmproj GGUF file.
2. Cross-Modal Projector: A learned projection layer (or small network) that maps encoder output features into the same dimensional space as the text model's token embeddings. This is the mmproj file. The projector must be specifically trained for the target text model, as embedding dimensions and feature distributions vary between architectures.
3. Text/Language Model: The base LLM that receives a sequence of token embeddings (some from text, some projected from other modalities) and generates text output through autoregressive decoding.
The separation of text model and projector into distinct GGUF files provides several benefits:
- Quantization flexibility: The text model can be quantized independently (e.g., Q4_K_M, Q5_K_S) while the projector typically remains at higher precision (e.g., F16) to preserve cross-modal alignment quality.
- Model composability: Different quantization levels of the same text model can share a single projector file.
- Storage efficiency: Users who only need text inference can skip downloading the projector entirely.
The pairing requirement means that an mmproj trained for one model family (e.g., LLaVA 1.6 based on Vicuna-7B) will not work with a different model family (e.g., Phi-3-Vision). The model provider must have trained both components together or ensured dimensional compatibility.