Implementation:Ggml org Llama cpp CLIP Model

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, Vision
Last Updated	2026-02-15 00:00 GMT

Overview

Defines the data structures for CLIP model hyperparameters, layers, and the complete model weight graph used by the multimodal vision subsystem.

Description

The `clip_hparams` struct stores vision encoder configuration including image/patch sizes, embedding dimensions, head counts, layer counts, normalization epsilon, RoPE theta, feature layers, window attention patterns, and audio preprocessing parameters. The `clip_layer` struct holds per-layer tensor pointers for attention (Q/K/V/output) and FFN (up/down/gate) weights with norms. The `clip_model` struct aggregates the full model: embeddings, position embeddings, all layers, projection matrices, and architecture-specific tensors for LLaVA, MiniCPM-V, Qwen2VL, InternVL, and others.

Usage

Use this header when implementing model loading, graph construction, or inference for CLIP vision encoders. It defines the complete in-memory representation that all multimodal model operations depend on.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/clip-model.h
Lines: 1-390

Signature

enum ffn_op_type { FFN_GELU, FFN_GELU_ERF, FFN_SILU, FFN_GELU_QUICK, FFN_RELU_SQR };
enum norm_type { NORM_TYPE_NORMAL, NORM_TYPE_RMS };
enum patch_merge_type { PATCH_MERGE_FLAT, PATCH_MERGE_SPATIAL_UNPAD };

struct clip_hparams {
    int32_t image_size, patch_size, n_embd, n_ff, projection_dim;
    int32_t n_head, n_layer;
    float image_mean[3], image_std[3];
    ffn_op_type ffn_op;
    float eps, rope_theta;
    // ... audio and architecture-specific fields
};

struct clip_layer { /* per-layer attention and FFN tensor pointers */ };
struct mobilenetv5_block { /* mobile-optimized block tensors */ };
struct clip_model { /* full model with embeddings, layers, projections */ };

Import

#include "ggml.h"
#include "clip.h"
#include "clip-impl.h"

I/O Contract

Inputs

Name	Type	Required	Description
GGUF metadata	key-value pairs	Yes	Model hyperparameters read from GGUF file (image_size, patch_size, n_embd, etc.)
GGUF tensors	ggml_tensor*	Yes	Model weight tensors loaded from GGUF file

Outputs

Name	Type	Description
clip_hparams	struct	Populated hyperparameter configuration for the vision encoder
clip_model	struct	Complete model with all layer weights and projection tensors loaded

Usage Examples

#include "clip-model.h"

// Access hyperparameters
clip_hparams hparams;
hparams.image_size = 224;
hparams.patch_size = 14;
hparams.n_embd = 768;
hparams.n_head = 12;
hparams.n_layer = 12;
hparams.ffn_op = FFN_GELU;

// Check window attention pattern
if (hparams.n_wa_pattern > 0) {
    // Use windowed attention for specified layers
}

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment