Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Liu00222 Open Prompt Injection Prompt Injection Experiment

From Leeroopedia
Knowledge Sources
Domains LLM_Security, Prompt_Injection, Benchmarking
Last Updated 2026-02-14 15:00 GMT

Overview

End-to-end process for running prompt injection attack experiments against LLM-integrated applications, evaluating both attack effectiveness and defense robustness using standardized metrics.

Description

This workflow implements the complete experimental pipeline for evaluating prompt injection attacks on LLM-integrated applications. It orchestrates the setup of a target NLP task (e.g., sentiment analysis, spam detection), an LLM backend (GPT, PaLM2, Llama, etc.), an injection attack strategy (naive, escape-char, ignore, fake-completion, or combined), and an optional defense mechanism (paraphrasing, retokenization, delimiters, sandwich, instructional, LLM-based, known-answer, perplexity-based). The pipeline produces four standardized metrics: PNA-T (target task accuracy post-attack), PNA-I (injected task baseline accuracy), ASV (attack success value), and MR (match rate between injected baseline and attack responses). Results are saved incrementally as NumPy archives for resumability.

Usage

Execute this workflow when you need to benchmark the effectiveness of a prompt injection attack strategy against a specific LLM and task combination, optionally with a defense mechanism enabled. You should have JSON configuration files for the model and tasks, and API keys configured for cloud-hosted models (GPT, PaLM2) or local model weights for open-source models (Llama, Flan-T5, Vicuna, DeepSeek).

Execution Steps

Step 1: Environment and Configuration Setup

Install the Conda environment from the provided specification file, which includes PyTorch, HuggingFace Transformers, and other dependencies. Prepare JSON configuration files for the target model (specifying provider, model name, API keys, and generation parameters) and for both the target and injected tasks (specifying dataset name, data paths, system prompt paths, and label mappings). Configure the output save path and select the attack strategy and defense mechanism via command-line arguments.

Key considerations:

  • Model configs live in configs/model_configs/ and task configs in configs/task_configs/
  • API keys for cloud models must be populated in the model config JSON
  • Local models require sufficient GPU memory (e.g., 7B models need ~16GB VRAM with quantization)

Step 2: Task and Model Initialization

Load the target task dataset using the factory function, which reads the task config, loads the appropriate dataset class (SST-2, SMS Spam, HSOL, JFLEG, Gigaword, MRPC, RTE, Math500, or Compromise), and prepares system prompts and ground-truth labels. Load the LLM backend via the model factory, which instantiates the appropriate wrapper (GPT, PaLM2, Flan, Llama, Llama3, Vicuna, DeepSeek, InternLM, or QLoRA). Then load the injected task dataset and create the attacker with the selected strategy.

Key considerations:

  • The data_num parameter controls how many samples are used from each dataset
  • Tasks loaded with for_injection=True use injection-specific system prompts
  • The attacker factory supports strategies: naive, escape, ignore, fake_completion, combine

Step 3: Application Assembly with Defense

Wrap the target task and model into an Application object, optionally applying a defense mechanism. The Application class constructs the instruction prompt (modifying it for instructional defense), prepares any defense-specific resources (BPE tables for retokenization, surrogate models for perplexity filtering), and provides the query interface that routes through pre-detection, preprocessing, prompt construction, model querying, and response post-processing.

Key considerations:

  • Seven defense options are available: paraphrasing, retokenization, delimiters, sandwich, instructional, LLM-based detection, known-answer detection
  • Perplexity-based defense requires loading a separate surrogate model (Vicuna-7B)
  • The no defense option sends prompts directly without modification

Step 4: Target Task Baseline Evaluation

Run the clean (unattacked) target task through the application to establish baseline performance. For each data sample, the application constructs the full prompt (instruction + data), queries the model, and records the response. Results are cached as a NumPy archive file so subsequent runs can skip this phase if the baseline already exists.

Key considerations:

  • Responses are saved incrementally to target_task_responses.npz
  • A 1-second sleep is inserted every 2 queries to respect API rate limits
  • If the file already exists, it is loaded instead of re-running queries

Step 5: Injected Task Baseline Evaluation

When no defense is applied, run the injected task data through the model directly (without the target task application wrapper) to establish how well the model performs the injected task in isolation. This provides the PNA-I baseline. When a defense is active, this step is skipped since the defense may alter injected task behavior.

Key considerations:

  • This step only runs when defense is set to no
  • The injected task instruction and data are concatenated and sent directly to the model
  • Results are cached in injected_task_responses.npz

Step 6: Attack Execution

For each target task sample, apply the selected attack strategy to inject the attacker's payload into the data prompt. The attacker modifies the clean data by appending, prepending, or wrapping injected instructions and data according to the strategy (e.g., CombineAttacker adds a fake task completion, an "ignore previous instructions" directive, and the injected task instruction with data). The modified prompt is then sent through the target application (with any active defense) and responses are recorded.

Key considerations:

  • The CombineAttacker generates task-specific fake completion text (e.g., "Answer: negative sentiment." for sentiment analysis)
  • Attack responses are cached in attack_responses.npz for resumability
  • Each attack query goes through the full defense pipeline if one is active

Step 7: Metrics Evaluation

Create an Evaluator that computes four standardized metrics from the collected responses. PNA-T measures how well the model still performs the target task after attack. PNA-I measures the model's baseline performance on the injected task. ASV (Attack Success Value) measures how successfully the injected task was completed during the attack. MR (Match Rate) compares attack responses against injected task baselines to measure response similarity. Different evaluation functions are used per task type (exact match for classification, GLEU for grammar correction, ROUGE for summarization).

Key considerations:

  • Each metric uses task-specific evaluation helpers (eval_sst2, eval_spam, eval_hsol, etc.)
  • JFLEG tasks use GLEU scoring which requires temporary file creation
  • Metrics are printed to stdout and the experiment ends with an [END] marker

Execution Diagram

GitHub URL

Workflow Repository