Principle:Avdvg InjectGuard Vector Store Construction

Knowledge Sources	Billion-scale similarity search with GPUs Efficient and robust approximate nearest neighbor search using HNSW InjectGuard
Domains	Information_Retrieval, Vector_Search, Security
Last Updated	2026-02-14 16:00 GMT

Overview

A technique for building an indexed vector database from document embeddings, enabling efficient approximate nearest neighbor search over a corpus of known malicious prompts.

Description

Vector store construction is the process of taking a collection of documents, computing their vector embeddings, and organizing them into an indexed data structure optimized for fast nearest neighbor retrieval. In the InjectGuard system, this step transforms the loaded malicious prompt dataset into a FAISS (Facebook AI Similarity Search) index.

The construction process involves two sub-steps:

Embedding computation: Each document's text content is passed through the embedding model to produce a dense vector representation.
Index building: The resulting vectors are inserted into a FAISS index structure that supports efficient similarity search. FAISS uses an IndexFlatL2 by default for exact L2 distance computation, which is appropriate for small-to-medium corpora (thousands to low millions of vectors).

Key design considerations:

Index type: Flat (exact) vs. approximate (IVF, HNSW). For small malicious prompt datasets, exact search is both fast and precise.
Dimensionality: Determined by the embedding model (384 for all-MiniLM-L6-v2).
Persistence: Whether the index is built in-memory (as in InjectGuard) or saved to disk for reuse.

Usage

Use this principle when you have a corpus of documents that need to be searchable by semantic similarity. It is the bridge between data loading and query-time retrieval. In security applications, it enables real-time comparison of incoming inputs against a database of known threats.

Theoretical Basis

FAISS IndexFlatL2 computes exact L2 (Euclidean) distances between a query vector and all stored vectors:

$d (q, x_{i}) = ‖ q - x_{i} ‖_{2} = \sqrt{\sum_{j = 1}^{d} (q_{j} - x_{i, j})^{2}}$

For n stored vectors of dimension d, a brute-force search has complexity $O (n \cdot d)$ . This is acceptable for datasets up to ~100K vectors on GPU or ~10K on CPU.

Pseudo-code:

# Abstract algorithm for vector store construction
vectors = []
for doc in documents:
    vec = embedding_model.encode(doc.text)
    vectors.append(vec)

index = create_flat_l2_index(dimension=len(vectors[0]))
index.add(vectors)
# index is now ready for similarity_search(query_vector, k)

The resulting index maps each vector back to its source document, allowing retrieval of both the distance score and the original malicious prompt text.

Related Pages

Implemented By

Implementation:Avdvg_InjectGuard_FAISS_From_Documents

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment