Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Norrrrrrr lyn WAInjectBench Text Prompt Injection Detection

From Leeroopedia
Knowledge Sources
Domains Prompt_Injection, Security, NLP, Benchmarking
Last Updated 2026-02-14 16:00 GMT

Overview

End-to-end process for evaluating text-based prompt injection detectors against the WAInjectBench benchmark dataset.

Description

This workflow runs a selected text-based prompt injection detector against a structured dataset of benign and malicious text samples. The benchmark supports six detector strategies: a fine-tuned classifier (DataSentinel), an embedding-based binary classifier, a known-answer detection canary method, a zero-shot LLM classifier (PromptArmor via GPT-4o), a purpose-built guard model (Meta PromptGuard), and a union ensemble of all detectors. Each detector is loaded dynamically via a plugin architecture and exposes a uniform detect() interface. Results are expressed as True Positive Rate (TPR) and False Positive Rate (FPR) and serialized to JSONL.

Usage

Execute this workflow when you need to benchmark a text-based prompt injection detector's accuracy on the WAInjectBench dataset. You have a directory containing benign and malicious JSONL text files organized into subfolders, and you want to measure how well a specific detector (or ensemble) distinguishes injected prompts from benign inputs.

Execution Steps

Step 1: Environment Setup

Prepare the runtime environment by installing dependencies from the provided Conda specification. Ensure GPU availability and set the CUDA_VISIBLE_DEVICES environment variable. For detectors that require external API keys (e.g., PromptArmor uses OpenAI GPT-4o), configure the OPENAI_API_KEY environment variable. For DataSentinel, clone the Open-Prompt-Injection repository and download the pretrained model checkpoint.

Key considerations:

  • The Conda environment is defined in environment.yml
  • PromptArmor requires a valid OpenAI API key
  • DataSentinel requires an external model checkpoint and repository clone

Step 2: Dataset Preparation

Organize text data into the expected directory structure. The data directory must contain two subdirectories: benign/ and malicious/, each containing JSONL files. Each JSONL record includes a text field. The benchmark covers 4 benign categories and 8 malicious attack types.

Expected structure:

  • data/text/benign/ — JSONL files with benign web agent instructions
  • data/text/malicious/ — JSONL files with injected prompts across 6 attack types

Step 3: Detector Selection and Loading

Choose a detector from the available options and invoke the evaluation framework via CLI. The framework dynamically imports the corresponding detector module from the detector_text/ package using Python's importlib. Each detector module exposes a detect(file_path) function that accepts a JSONL file path and returns a list of detected sample IDs.

Available detectors:

  • kad — Known-Answer Detection using a canary token approach with Mistral-7B
  • promptarmor — Zero-shot classification via GPT-4o
  • embedding-t — Sentence embedding + LogisticRegression classifier
  • promptguard — Meta Llama Prompt Guard 2 (86M parameters)
  • datasentinel — Fine-tuned Mistral-based classifier
  • ensemble — Union aggregation of all individual detector results

Step 4: Per-file Detection Execution

The framework iterates over all JSONL files in both the benign and malicious subdirectories. For each file, it invokes the detector's detect() function, which processes all records and returns IDs of samples classified as injections. The total number of samples per file is counted by reading the JSONL line count.

What happens:

  • Each JSONL file is passed to the detector independently
  • The detector returns a list of integer IDs for flagged samples
  • The framework tracks which folder (benign or malicious) each file came from

Step 5: Metric Computation

For each file, the framework computes either TPR or FPR depending on the source folder. For malicious files, TPR is calculated as the ratio of detected IDs to total samples (measuring recall of actual injections). For benign files, FPR is calculated as the ratio of false detections to total samples (measuring false alarm rate).

Key metrics:

  • TPR (True Positive Rate) = detected_count / total_malicious for malicious files
  • FPR (False Positive Rate) = detected_count / total_benign for benign files

Step 6: Results Serialization

All per-file results are collected and written to a single JSONL output file named after the detector (e.g., kad.jsonl). Each result record contains the data file name, the computed rate (TPR or FPR), the list of detected IDs, and the total sample count.

Output format per record:

  • data_name, tpr or fpr, detect_ids list, total_num

Step 7: Ensemble Aggregation (Optional)

If the ensemble detector is selected, instead of running a new detection pass, it reads all existing per-detector JSONL result files from the output directory. It unions the detected IDs across all detectors for each data file, then recomputes TPR and FPR on the combined set. This maximizes recall at the potential cost of increased false positives.

Key considerations:

  • Ensemble requires all individual detector results to already exist in the result directory
  • Union strategy ensures any sample flagged by any detector is included
  • Ensemble results are saved as ensemble.jsonl

Execution Diagram

GitHub URL

Workflow Repository