Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Clip

From Leeroopedia
Knowledge Sources
Domains Multimodal, CLIP
Last Updated 2025-02-15 00:00 GMT

Overview

Main implementation file for the CLIP multimodal encoder system, handling model loading from GGUF, image preprocessing, computation graph evaluation, and the public C API.

Description

This is the largest and most critical file in the multimodal subsystem at 3603 lines. It implements the clip_model_loader for reading hyperparameters and weight tensors from GGUF files and initializing ggml backends. It implements image preprocessing including resizing, normalization, padding, and tiling for UHD/high-resolution models (LLaVA-UHD style). It implements the graph evaluation pipeline that dispatches to the appropriate clip_graph_* subclass, allocates tensors, sets input data, and runs inference via ggml backends. It provides the complete public C API for creating/freeing contexts, preprocessing images, running encoding, and querying model properties.

Usage

Core backend file -- not called directly by application code. Instead, the mtmd library calls into these functions for model loading and image encoding.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/clip.cpp
  • Lines: 1-3603

Signature

// Model loading
struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);

// Image preprocessing
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img,
                           struct clip_image_f32_batch * res_imgs);

// Encoding
bool clip_image_encode(struct clip_ctx * ctx, int n_threads,
                       struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads,
                             const struct clip_image_f32_batch * imgs, float * vec);

// Model queries
int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);

Import

#include "clip.h"
#include "clip-impl.h"
#include "clip-model.h"
#include "clip-graph.h"

I/O Contract

Inputs

Name Type Required Description
fname const char * Yes Path to GGUF model file for the CLIP encoder
ctx_params clip_context_params Yes Configuration including GPU usage, flash attention, warmup
img clip_image_u8 * Yes Raw RGB image to preprocess and encode
n_threads int Yes Number of CPU threads for inference

Outputs

Name Type Description
clip_ctx struct * Initialized CLIP context with loaded model
vec float * Output embedding vector from the encoder
res_imgs clip_image_f32_batch * Preprocessed float image(s) ready for encoding

Usage Examples

// Initialize CLIP from a GGUF file
clip_context_params params = {true, CLIP_FLASH_ATTN_TYPE_AUTO, -1, -1, true};
clip_init_result result = clip_init("mmproj.gguf", params);
clip_ctx * ctx = result.ctx_v;

// Preprocess an image
clip_image_u8 * img_u8 = clip_image_u8_init();
clip_build_img_from_pixels(rgb_data, 224, 224, img_u8);
clip_image_f32_batch * batch = clip_image_f32_batch_init();
clip_image_preprocess(ctx, img_u8, batch);

// Encode to embeddings
std::vector<float> embd(clip_embd_nbytes(ctx) / sizeof(float));
clip_image_batch_encode(ctx, 4, batch, embd.data());

clip_free(ctx);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment