Implementation:Ollama Ollama Clip

Knowledge Sources	Ollama
Domains	Multimodal, CLIP
Last Updated	2025-02-15 00:00 GMT

Overview

Main implementation file for the CLIP multimodal encoder system, handling model loading from GGUF, image preprocessing, computation graph evaluation, and the public C API.

Description

This is the largest and most critical file in the multimodal subsystem at 3603 lines. It implements the clip_model_loader for reading hyperparameters and weight tensors from GGUF files and initializing ggml backends. It implements image preprocessing including resizing, normalization, padding, and tiling for UHD/high-resolution models (LLaVA-UHD style). It implements the graph evaluation pipeline that dispatches to the appropriate clip_graph_* subclass, allocates tensors, sets input data, and runs inference via ggml backends. It provides the complete public C API for creating/freeing contexts, preprocessing images, running encoding, and querying model properties.

Usage

Core backend file -- not called directly by application code. Instead, the mtmd library calls into these functions for model loading and image encoding.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/clip.cpp
Lines: 1-3603

Signature

// Model loading
struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);

// Image preprocessing
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img,
                           struct clip_image_f32_batch * res_imgs);

// Encoding
bool clip_image_encode(struct clip_ctx * ctx, int n_threads,
                       struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads,
                             const struct clip_image_f32_batch * imgs, float * vec);

// Model queries
int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);

Import

#include "clip.h"
#include "clip-impl.h"
#include "clip-model.h"
#include "clip-graph.h"

I/O Contract

Inputs

Name	Type	Required	Description
fname	const char *	Yes	Path to GGUF model file for the CLIP encoder
ctx_params	clip_context_params	Yes	Configuration including GPU usage, flash attention, warmup
img	clip_image_u8 *	Yes	Raw RGB image to preprocess and encode
n_threads	int	Yes	Number of CPU threads for inference

Outputs

Name	Type	Description
clip_ctx	struct *	Initialized CLIP context with loaded model
vec	float *	Output embedding vector from the encoder
res_imgs	clip_image_f32_batch *	Preprocessed float image(s) ready for encoding

Usage Examples

// Initialize CLIP from a GGUF file
clip_context_params params = {true, CLIP_FLASH_ATTN_TYPE_AUTO, -1, -1, true};
clip_init_result result = clip_init("mmproj.gguf", params);
clip_ctx * ctx = result.ctx_v;

// Preprocess an image
clip_image_u8 * img_u8 = clip_image_u8_init();
clip_build_img_from_pixels(rgb_data, 224, 224, img_u8);
clip_image_f32_batch * batch = clip_image_f32_batch_init();
clip_image_preprocess(ctx, img_u8, batch);

// Encode to embeddings
std::vector<float> embd(clip_embd_nbytes(ctx) / sizeof(float));
clip_image_batch_encode(ctx, 4, batch, embd.data());

clip_free(ctx);

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment