Heuristic:Avdvg InjectGuard Embedding Normalization Cosine Equivalence
| Knowledge Sources | |
|---|---|
| Domains | NLP, Embeddings, Optimization |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Enabling L2 normalization on embeddings (normalize_embeddings=True) converts FAISS L2 distance computations into cosine similarity equivalents, simplifying threshold interpretation.
Description
The InjectGuard system enables L2 normalization on the HuggingFaceEmbeddings encoder via encode_kwargs={'normalize_embeddings': True}. This ensures all output vectors have unit L2 norm (||v||₂ = 1). When all vectors are unit-normalized, the squared L2 distance between two vectors is directly related to their cosine similarity:
||a - b||₂² = 2(1 - cos(a, b))
This means FAISS IndexFlatL2 (the default) effectively performs cosine similarity ranking without needing to switch to an inner-product index. The distance scores returned by similarity_search_with_score are bounded between 0 (identical) and 4 (opposite), with typical useful range 0-2.
Usage
Use this heuristic when interpreting sim_score values returned by the detection system, or when tuning the sim_k threshold. Understanding that the L2 distances are cosine-equivalent makes threshold values more intuitive. It also means that switching FAISS to IndexFlatIP (inner product) would give equivalent ranking, if needed for compatibility with other systems.
The Insight (Rule of Thumb)
- Action: Always set
normalize_embeddings=Truein encode_kwargs when using L2-based FAISS indices for semantic similarity. - Value: Distances become bounded and interpretable: 0 = identical, ~1 = unrelated, 2 = maximally dissimilar.
- Trade-off: Normalization adds a trivial compute cost (vector division) but eliminates the need for a separate cosine similarity index. Magnitude information in the embeddings is discarded.
- Compatibility: This approach is compatible with any FAISS index type. For IndexFlatL2 (the default in LangChain), it is the recommended configuration.
Reasoning
Without normalization, L2 distances are unbounded and hard to interpret — two semantically similar texts could have very different L2 distances depending on their embedding magnitudes. Normalization eliminates this confound.
Code evidence from vertor_similarity_detection.py:10-12:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cuda:2'},
encode_kwargs={'normalize_embeddings':True})
The normalize_embeddings=True flag tells sentence-transformers to L2-normalize every output vector. Combined with FAISS IndexFlatL2, this means the sim_score in sim_search is a normalized L2 distance that can be directly converted to cosine similarity: cosine_sim = 1 - (sim_score² / 2).