Implementation:Ggml org Llama cpp CLIP Model
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Defines the data structures for CLIP model hyperparameters, layers, and the complete model weight graph used by the multimodal vision subsystem.
Description
The `clip_hparams` struct stores vision encoder configuration including image/patch sizes, embedding dimensions, head counts, layer counts, normalization epsilon, RoPE theta, feature layers, window attention patterns, and audio preprocessing parameters. The `clip_layer` struct holds per-layer tensor pointers for attention (Q/K/V/output) and FFN (up/down/gate) weights with norms. The `clip_model` struct aggregates the full model: embeddings, position embeddings, all layers, projection matrices, and architecture-specific tensors for LLaVA, MiniCPM-V, Qwen2VL, InternVL, and others.
Usage
Use this header when implementing model loading, graph construction, or inference for CLIP vision encoders. It defines the complete in-memory representation that all multimodal model operations depend on.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/clip-model.h
- Lines: 1-390
Signature
enum ffn_op_type { FFN_GELU, FFN_GELU_ERF, FFN_SILU, FFN_GELU_QUICK, FFN_RELU_SQR };
enum norm_type { NORM_TYPE_NORMAL, NORM_TYPE_RMS };
enum patch_merge_type { PATCH_MERGE_FLAT, PATCH_MERGE_SPATIAL_UNPAD };
struct clip_hparams {
int32_t image_size, patch_size, n_embd, n_ff, projection_dim;
int32_t n_head, n_layer;
float image_mean[3], image_std[3];
ffn_op_type ffn_op;
float eps, rope_theta;
// ... audio and architecture-specific fields
};
struct clip_layer { /* per-layer attention and FFN tensor pointers */ };
struct mobilenetv5_block { /* mobile-optimized block tensors */ };
struct clip_model { /* full model with embeddings, layers, projections */ };
Import
#include "ggml.h"
#include "clip.h"
#include "clip-impl.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| GGUF metadata | key-value pairs | Yes | Model hyperparameters read from GGUF file (image_size, patch_size, n_embd, etc.) |
| GGUF tensors | ggml_tensor* | Yes | Model weight tensors loaded from GGUF file |
Outputs
| Name | Type | Description |
|---|---|---|
| clip_hparams | struct | Populated hyperparameter configuration for the vision encoder |
| clip_model | struct | Complete model with all layer weights and projection tensors loaded |
Usage Examples
#include "clip-model.h"
// Access hyperparameters
clip_hparams hparams;
hparams.image_size = 224;
hparams.patch_size = 14;
hparams.n_embd = 768;
hparams.n_head = 12;
hparams.n_layer = 12;
hparams.ffn_op = FFN_GELU;
// Check window attention pattern
if (hparams.n_wa_pattern > 0) {
// Use windowed attention for specified layers
}