Workflow:Norrrrrrr lyn WAInjectBench Image Prompt Injection Detection

Knowledge Sources	WAInjectBench Transformers OpenAI API OpenCLIP
Domains	Prompt_Injection, Security, Computer_Vision, Benchmarking
Last Updated	2026-02-14 16:00 GMT

Overview

End-to-end process for evaluating image-based prompt injection detectors against the WAInjectBench benchmark dataset.

Description

This workflow runs a selected image-based prompt injection detector against a structured dataset of benign and malicious image samples. The benchmark supports six detector strategies: a CLIP embedding-based binary classifier, a GPT-4o vision zero-shot classifier, a mutation-testing approach (JailGuard), a LLaVA vision-language model in both zero-shot prompt and fine-tuned modes, and a union ensemble of all detectors. Each detector is loaded dynamically via a plugin architecture. Unlike the text pipeline which processes JSONL files, the image pipeline processes image directories where each subfolder contains image files. Results are expressed as TPR and FPR and serialized to JSONL.

Usage

Execute this workflow when you need to benchmark an image-based prompt injection detector on the WAInjectBench dataset. You have a directory containing benign and malicious image subfolders organized by attack type, and you want to measure how well a specific detector distinguishes images containing injected prompts from benign images.

Execution Steps

Step 1: Environment Setup

Prepare the runtime environment by installing dependencies from the provided Conda specification. Ensure GPU availability and configure CUDA_VISIBLE_DEVICES. For GPT-4o-based detection, set the OPENAI_API_KEY environment variable. For JailGuard, clone the JailGuard repository and configure MiniGPT-4. For the fine-tuned LLaVA detector, download the fine-tuned checkpoint and set its path in the detector configuration.

Key considerations:

GPT-4o vision detector requires a valid OpenAI API key
JailGuard depends on MiniGPT-4 and spaCy
LLaVA fine-tuned mode requires a pre-trained LoRA checkpoint

Step 2: Dataset Preparation

Organize image data into the expected directory structure. The data directory must contain two subdirectories: benign/ and malicious/, each containing named subfolders. Each subfolder holds individual image files (not JSONL). The benchmark covers 2 benign categories and 7 malicious attack types.

Expected structure:

data/image/benign/{category}/ — image files with benign web page screenshots
data/image/malicious/{attack_type}/ — image files containing embedded prompt injections

Step 3: Detector Selection and Loading

Choose a detector from the available options and invoke the evaluation framework via CLI. The framework dynamically imports the corresponding detector module from the detector_image/ package. LLaVA detectors have special routing: both the zero-shot prompt variant and the fine-tuned variant are handled by the same llava module, distinguished by the detector name passed at runtime.

Available detectors:

gpt-4o-prompt — Zero-shot visual classification via OpenAI GPT-4o vision API
llava-1.5-7b-prompt — LLaVA zero-shot prompt-based classification
llava-1.5-7b-ft — LLaVA fine-tuned binary classifier using LoRA checkpoint
jailguard — Mutation testing with MiniGPT divergence analysis
embedding-i — CLIP image embedding + LogisticRegression classifier
ensemble — Union aggregation of all individual detector results

Step 4: Per-folder Detection Execution

The framework iterates over all subfolders within benign and malicious directories. For each subfolder, it invokes the detector's detect() function, which processes all images in that folder and returns file IDs of samples classified as injections. The total sample count is determined by counting all files in the subfolder.

What happens:

Each image subfolder is passed to the detector as a directory path
The detector returns a list of integer file IDs for flagged images
For LLaVA variants, the detector name is passed as an additional parameter to select prompt vs. fine-tuned mode

Step 5: Metric Computation

For each subfolder, the framework computes either TPR or FPR depending on whether it came from the malicious or benign parent directory. TPR measures the proportion of malicious images correctly flagged. FPR measures the proportion of benign images incorrectly flagged.

Key metrics:

TPR = detected_count / total_images for malicious folders
FPR = detected_count / total_images for benign folders

Step 6: Results Serialization

All per-folder results are collected and written to a single JSONL output file named after the detector. Each record contains the subfolder name, the computed rate, the list of detected file IDs, and the total image count.

Step 7: Ensemble Aggregation (Optional)

If the ensemble detector is selected, it reads all existing per-detector JSONL result files from the output directory, unions detected IDs across detectors for each data folder, and recomputes TPR and FPR. This maximizes recall by combining all detector signals.

Key considerations:

Ensemble requires all individual detector results to already exist
Uses the same union aggregation strategy as the text ensemble
Results saved as ensemble.jsonl

Execution Diagram

GitHub URL

Workflow Repository