Implementation:Ollama Ollama Clip
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, CLIP |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Main implementation file for the CLIP multimodal encoder system, handling model loading from GGUF, image preprocessing, computation graph evaluation, and the public C API.
Description
This is the largest and most critical file in the multimodal subsystem at 3603 lines. It implements the clip_model_loader for reading hyperparameters and weight tensors from GGUF files and initializing ggml backends. It implements image preprocessing including resizing, normalization, padding, and tiling for UHD/high-resolution models (LLaVA-UHD style). It implements the graph evaluation pipeline that dispatches to the appropriate clip_graph_* subclass, allocates tensors, sets input data, and runs inference via ggml backends. It provides the complete public C API for creating/freeing contexts, preprocessing images, running encoding, and querying model properties.
Usage
Core backend file -- not called directly by application code. Instead, the mtmd library calls into these functions for model loading and image encoding.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/clip.cpp
- Lines: 1-3603
Signature
// Model loading
struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);
// Image preprocessing
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img,
struct clip_image_f32_batch * res_imgs);
// Encoding
bool clip_image_encode(struct clip_ctx * ctx, int n_threads,
struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads,
const struct clip_image_f32_batch * imgs, float * vec);
// Model queries
int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);
Import
#include "clip.h"
#include "clip-impl.h"
#include "clip-model.h"
#include "clip-graph.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fname | const char * | Yes | Path to GGUF model file for the CLIP encoder |
| ctx_params | clip_context_params | Yes | Configuration including GPU usage, flash attention, warmup |
| img | clip_image_u8 * | Yes | Raw RGB image to preprocess and encode |
| n_threads | int | Yes | Number of CPU threads for inference |
Outputs
| Name | Type | Description |
|---|---|---|
| clip_ctx | struct * | Initialized CLIP context with loaded model |
| vec | float * | Output embedding vector from the encoder |
| res_imgs | clip_image_f32_batch * | Preprocessed float image(s) ready for encoding |
Usage Examples
// Initialize CLIP from a GGUF file
clip_context_params params = {true, CLIP_FLASH_ATTN_TYPE_AUTO, -1, -1, true};
clip_init_result result = clip_init("mmproj.gguf", params);
clip_ctx * ctx = result.ctx_v;
// Preprocess an image
clip_image_u8 * img_u8 = clip_image_u8_init();
clip_build_img_from_pixels(rgb_data, 224, 224, img_u8);
clip_image_f32_batch * batch = clip_image_f32_batch_init();
clip_image_preprocess(ctx, img_u8, batch);
// Encode to embeddings
std::vector<float> embd(clip_embd_nbytes(ctx) / sizeof(float));
clip_image_batch_encode(ctx, 4, batch, embd.data());
clip_free(ctx);