Principle:Norrrrrrr lyn WAInjectBench Image Embedding Initialization
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Representation_Learning |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
A visual embedding model initialization step that loads a pre-trained CLIP model for encoding images into fixed-dimensional dense vectors.
Description
CLIP (Contrastive Language-Image Pre-training) models learn a shared embedding space for images and text through contrastive learning on internet-scale image-text pairs. The WAInjectBench project uses ViT-B-32 with LAION-2B pre-trained weights via the OpenCLIP library, producing 512-dimensional image embeddings. These embeddings serve as features for a downstream LogisticRegression classifier for image-based prompt injection detection.
The model initialization returns three components: the model itself, an unused text tokenizer, and an image preprocessing transform.
Usage
Use this when you need to convert images into dense vector representations for classification. This is the prerequisite for image feature extraction in the embedding-based image detector training pipeline.
Theoretical Basis
CLIP encodes images using a Vision Transformer (ViT):
The ViT-B-32 architecture splits images into 32x32 patches, processes them through 12 transformer layers, and produces a 512-dim CLS token embedding.