Implementation:Ollama Ollama Mtmd LLaVA
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for LLaVA-style vision models, also serving as the default adapter for Granite and GLM-Edge variants.
Description
Implements clip_graph_llava::build() which constructs a ggml computation graph for the LLaVA vision encoder. It supports optional class embedding concatenation, learned position embeddings, pre-layer normalization, deep feature stacking (extracting activations from multiple intermediate layers as used by Granite vision), and various projector backends including LLaVA MLP, MLP with normalization, LDP/LDPv2 (MobileVLM), MiniCPM-V resampler, and GLM-Edge adapter. Each projector maps ViT features to the language model's embedding space.
Usage
Automatically selected by the CLIP system when the loaded model uses a LLaVA-compatible projector type (MLP, MLP_NORM, LDP, LDPV2, MiniCPM-V, GLM-Edge).
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/llava.cpp
- Lines: 1-374
Signature
struct clip_graph_llava : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded CLIP model with weights and hyperparameters |
| img | clip_image_f32 & | Yes | Preprocessed float image tensor |
| n_patches | int | Yes | Number of image patches after patch embedding |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing embeddings for the language model |
Usage Examples
// The graph builder is instantiated internally by clip.cpp
// during clip_image_encode / clip_image_batch_encode
clip_graph_llava graph(ctx, img);
ggml_cgraph * gf = graph.build();
// gf is then evaluated by ggml backends to produce embeddings