Heuristic:Norrrrrrr lyn WAInjectBench L2 Normalize CLIP Embeddings

Knowledge Sources	WAInjectBench CLIP embedding pipeline
Domains	Optimization, Computer_Vision, Classification
Last Updated	2026-02-14 16:00 GMT

Overview

L2-normalizing CLIP image embeddings before feeding them to the LogisticRegression classifier to ensure consistent magnitude and improve classification quality.

Description

After extracting image embeddings with the OpenCLIP ViT-B-32 model, the training pipeline normalizes each embedding vector to unit length using L2 normalization (`emb / emb.norm(dim=-1, keepdim=True)`). This ensures all embeddings lie on the unit hypersphere, removing magnitude variance and focusing the classifier on directional (angular) differences between benign and malicious image embeddings. This is a standard practice in CLIP-based pipelines, as CLIP was trained with a cosine similarity objective.

Usage

Use this heuristic whenever extracting CLIP embeddings for downstream classification. Without normalization, embedding magnitudes can vary across images, causing the linear classifier to be influenced by irrelevant scale differences.

The Insight (Rule of Thumb)

Action: Apply `emb = emb / emb.norm(dim=-1, keepdim=True)` after `model.encode_image()`.
Value: Produces 512-dimensional unit vectors for ViT-B-32.
Trade-off: None — this is essentially free and always improves downstream classification with linear models. Normalization discards magnitude information which is typically noise for CLIP embeddings.

Reasoning

CLIP models are trained with a contrastive loss that optimizes cosine similarity between matching text-image pairs. The learned representation is therefore directional: the angle between embeddings carries semantic meaning, while the magnitude does not. Using unnormalized embeddings with a linear classifier (LogisticRegression) would allow the model to exploit magnitude differences that are artifacts of the model's internal scaling rather than meaningful features.

Code Evidence

L2 normalization during embedding extraction from `train/embedding-i.py:34-35`:

with torch.no_grad():
    emb = model.encode_image(image)
    emb = emb / emb.norm(dim=-1, keepdim=True)  # normalize

Note that the inference-time detector (`detector_image/embedding-i.py:56-58`) does not explicitly normalize, relying on the classifier having been trained on normalized embeddings and the CLIP model producing embeddings of similar magnitude at inference:

with torch.no_grad():
    emb = CLIP_MODEL.encode_image(image)
emb = emb.cpu().numpy().flatten()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment