Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Avdvg InjectGuard Vector Similarity Detection Pipeline

From Leeroopedia
Knowledge Sources
Domains AI_Safety, Prompt_Injection_Detection, Vector_Search
Last Updated 2026-02-14 16:00 GMT

Overview

End-to-end process for detecting prompt injection attacks against Large Language Models using vector similarity search over a known-malicious prompt database.

Description

This workflow implements a prompt injection detection system that compares incoming user prompts against a curated database of known malicious prompts (jailbreak attacks, hijacking attempts, prompt leakage). It uses sentence-transformer embeddings to convert prompts into vectors, indexes them with FAISS for fast nearest-neighbor search, and flags inputs whose similarity to known attacks exceeds a configurable threshold. The pipeline also includes an evaluation harness that measures detection accuracy, precision, recall, and F1 score against labeled test data.

Key characteristics:

  • Uses the all-MiniLM-L6-v2 sentence-transformer model for embedding generation
  • Employs FAISS vector store for efficient similarity search
  • Configurable similarity threshold (sim_k) to balance precision vs. recall
  • Outputs per-sample detection results and aggregate metrics

Usage

Execute this workflow when you need to screen user-submitted prompts for potential injection attacks before passing them to a Large Language Model. This is applicable when you have a curated CSV dataset of known malicious prompts and want a lightweight, embedding-based detection layer that does not require fine-tuning a classification model. The approach is most suitable for environments where known attack patterns are well-represented in the malicious dataset and fast inference is preferred over model-based classification.

Execution Steps

Step 1: Embedding Model Initialization

Initialize the sentence-transformer embedding model that will convert text prompts into dense vector representations. The model is loaded with GPU acceleration and configured to produce normalized embeddings, which ensures that cosine similarity can be computed directly from L2 distance in the resulting vector space.

Key considerations:

  • The default model is all-MiniLM-L6-v2, a lightweight but effective sentence encoder
  • Embeddings are normalized at encoding time so similarity scores are on a consistent scale
  • GPU device assignment must match available hardware

Step 2: Malicious Dataset Loading

Load the curated dataset of known malicious prompts from a CSV file. Each row contains an identifier and the text of a known attack prompt (e.g., jailbreak attempts, prompt hijacking phrases). The CSV loader parses each row into a document object suitable for vector indexing.

Key considerations:

  • CSV format requires columns: id, text
  • The dataset should be comprehensive and regularly updated with newly discovered attack patterns
  • Each document becomes a searchable entry in the vector store

Step 3: Vector Store Construction

Build a FAISS vector index from the loaded malicious documents. Each document is embedded using the initialized sentence-transformer model and inserted into a FAISS index that supports efficient approximate nearest-neighbor search. This index serves as the reference database against which incoming prompts are compared.

Key considerations:

  • FAISS provides sub-linear search time for large document collections
  • The index is built in-memory and can be persisted to disk for reuse
  • Index construction time scales with the number of malicious samples and embedding dimensionality

Step 4: Similarity Search Detection

For each incoming prompt, perform a nearest-neighbor search against the malicious vector store. The search returns the closest matching malicious prompt and its similarity score. If the score falls below the configured threshold (sim_k), the input is flagged as a potential injection attack.

What happens:

  • Input text is embedded using the same sentence-transformer model
  • FAISS retrieves the single nearest neighbor (k=1) and its distance score
  • A lower distance score indicates higher similarity to a known attack
  • Detection decision: flag as malicious (1) if score < sim_k, otherwise benign (0)
  • Result includes the detection label, similarity score, and the matched source document

Step 5: Evaluation and Metrics

Run the detection pipeline over a labeled test dataset to measure performance. Each test sample includes a ground-truth label indicating whether it is a genuine attack or benign input. The pipeline compares predicted labels against ground truth and computes standard classification metrics.

Metrics computed:

  • Accuracy: Overall correctness of predictions
  • Precision: Proportion of flagged prompts that are truly malicious
  • Recall: Proportion of actual attacks that are correctly detected
  • F1 Score: Harmonic mean of precision and recall
  • All results are logged for analysis and threshold tuning

Execution Diagram

GitHub URL

Workflow Repository