Principle:Norrrrrrr lyn WAInjectBench Image Embedding Initialization

Knowledge Sources	CLIP OpenCLIP
Domains	Computer_Vision, Representation_Learning
Last Updated	2026-02-14 16:00 GMT

Overview

A visual embedding model initialization step that loads a pre-trained CLIP model for encoding images into fixed-dimensional dense vectors.

Description

CLIP (Contrastive Language-Image Pre-training) models learn a shared embedding space for images and text through contrastive learning on internet-scale image-text pairs. The WAInjectBench project uses ViT-B-32 with LAION-2B pre-trained weights via the OpenCLIP library, producing 512-dimensional image embeddings. These embeddings serve as features for a downstream LogisticRegression classifier for image-based prompt injection detection.

The model initialization returns three components: the model itself, an unused text tokenizer, and an image preprocessing transform.

Usage

Use this when you need to convert images into dense vector representations for classification. This is the prerequisite for image feature extraction in the embedding-based image detector training pipeline.

Theoretical Basis

CLIP encodes images using a Vision Transformer (ViT):

$𝐯_{i m a g e} = ViT (PatchEmbed (i m a g e))$

The ViT-B-32 architecture splits images into 32x32 patches, processes them through 12 transformer layers, and produces a 512-dim CLS token embedding.

Related Pages

Implemented By

Implementation:Norrrrrrr_lyn_WAInjectBench_OpenCLIP_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment