Workflow:Liu00222 Open Prompt Injection Prompt Injection Experiment
| Knowledge Sources | |
|---|---|
| Domains | LLM_Security, Prompt_Injection, Benchmarking |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
End-to-end process for running prompt injection attack experiments against LLM-integrated applications, evaluating both attack effectiveness and defense robustness using standardized metrics.
Description
This workflow implements the complete experimental pipeline for evaluating prompt injection attacks on LLM-integrated applications. It orchestrates the setup of a target NLP task (e.g., sentiment analysis, spam detection), an LLM backend (GPT, PaLM2, Llama, etc.), an injection attack strategy (naive, escape-char, ignore, fake-completion, or combined), and an optional defense mechanism (paraphrasing, retokenization, delimiters, sandwich, instructional, LLM-based, known-answer, perplexity-based). The pipeline produces four standardized metrics: PNA-T (target task accuracy post-attack), PNA-I (injected task baseline accuracy), ASV (attack success value), and MR (match rate between injected baseline and attack responses). Results are saved incrementally as NumPy archives for resumability.
Usage
Execute this workflow when you need to benchmark the effectiveness of a prompt injection attack strategy against a specific LLM and task combination, optionally with a defense mechanism enabled. You should have JSON configuration files for the model and tasks, and API keys configured for cloud-hosted models (GPT, PaLM2) or local model weights for open-source models (Llama, Flan-T5, Vicuna, DeepSeek).
Execution Steps
Step 1: Environment and Configuration Setup
Install the Conda environment from the provided specification file, which includes PyTorch, HuggingFace Transformers, and other dependencies. Prepare JSON configuration files for the target model (specifying provider, model name, API keys, and generation parameters) and for both the target and injected tasks (specifying dataset name, data paths, system prompt paths, and label mappings). Configure the output save path and select the attack strategy and defense mechanism via command-line arguments.
Key considerations:
- Model configs live in configs/model_configs/ and task configs in configs/task_configs/
- API keys for cloud models must be populated in the model config JSON
- Local models require sufficient GPU memory (e.g., 7B models need ~16GB VRAM with quantization)
Step 2: Task and Model Initialization
Load the target task dataset using the factory function, which reads the task config, loads the appropriate dataset class (SST-2, SMS Spam, HSOL, JFLEG, Gigaword, MRPC, RTE, Math500, or Compromise), and prepares system prompts and ground-truth labels. Load the LLM backend via the model factory, which instantiates the appropriate wrapper (GPT, PaLM2, Flan, Llama, Llama3, Vicuna, DeepSeek, InternLM, or QLoRA). Then load the injected task dataset and create the attacker with the selected strategy.
Key considerations:
- The data_num parameter controls how many samples are used from each dataset
- Tasks loaded with for_injection=True use injection-specific system prompts
- The attacker factory supports strategies: naive, escape, ignore, fake_completion, combine
Step 3: Application Assembly with Defense
Wrap the target task and model into an Application object, optionally applying a defense mechanism. The Application class constructs the instruction prompt (modifying it for instructional defense), prepares any defense-specific resources (BPE tables for retokenization, surrogate models for perplexity filtering), and provides the query interface that routes through pre-detection, preprocessing, prompt construction, model querying, and response post-processing.
Key considerations:
- Seven defense options are available: paraphrasing, retokenization, delimiters, sandwich, instructional, LLM-based detection, known-answer detection
- Perplexity-based defense requires loading a separate surrogate model (Vicuna-7B)
- The no defense option sends prompts directly without modification
Step 4: Target Task Baseline Evaluation
Run the clean (unattacked) target task through the application to establish baseline performance. For each data sample, the application constructs the full prompt (instruction + data), queries the model, and records the response. Results are cached as a NumPy archive file so subsequent runs can skip this phase if the baseline already exists.
Key considerations:
- Responses are saved incrementally to target_task_responses.npz
- A 1-second sleep is inserted every 2 queries to respect API rate limits
- If the file already exists, it is loaded instead of re-running queries
Step 5: Injected Task Baseline Evaluation
When no defense is applied, run the injected task data through the model directly (without the target task application wrapper) to establish how well the model performs the injected task in isolation. This provides the PNA-I baseline. When a defense is active, this step is skipped since the defense may alter injected task behavior.
Key considerations:
- This step only runs when defense is set to no
- The injected task instruction and data are concatenated and sent directly to the model
- Results are cached in injected_task_responses.npz
Step 6: Attack Execution
For each target task sample, apply the selected attack strategy to inject the attacker's payload into the data prompt. The attacker modifies the clean data by appending, prepending, or wrapping injected instructions and data according to the strategy (e.g., CombineAttacker adds a fake task completion, an "ignore previous instructions" directive, and the injected task instruction with data). The modified prompt is then sent through the target application (with any active defense) and responses are recorded.
Key considerations:
- The CombineAttacker generates task-specific fake completion text (e.g., "Answer: negative sentiment." for sentiment analysis)
- Attack responses are cached in attack_responses.npz for resumability
- Each attack query goes through the full defense pipeline if one is active
Step 7: Metrics Evaluation
Create an Evaluator that computes four standardized metrics from the collected responses. PNA-T measures how well the model still performs the target task after attack. PNA-I measures the model's baseline performance on the injected task. ASV (Attack Success Value) measures how successfully the injected task was completed during the attack. MR (Match Rate) compares attack responses against injected task baselines to measure response similarity. Different evaluation functions are used per task type (exact match for classification, GLEU for grammar correction, ROUGE for summarization).
Key considerations:
- Each metric uses task-specific evaluation helpers (eval_sst2, eval_spam, eval_hsol, etc.)
- JFLEG tasks use GLEU scoring which requires temporary file creation
- Metrics are printed to stdout and the experiment ends with an [END] marker