Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Promptfoo Promptfoo Red Team Security Scan

From Leeroopedia
Knowledge Sources
Domains AI_Security, Red_Teaming, Vulnerability_Assessment
Last Updated 2026-02-14 08:00 GMT

Overview

End-to-end process for automatically discovering security vulnerabilities in LLM applications through adversarial testing with dynamic attack generation, multi-strategy delivery, and automated grading.

Description

This workflow implements Promptfoo's automated red teaming pipeline. It combines plugin-based test generation (50+ vulnerability types including prompt injection, jailbreaking, PII disclosure, and harmful content) with strategy-based attack delivery (encoding transforms, multi-turn conversations, iterative refinement) to systematically probe an LLM application for security weaknesses. The pipeline extracts the target system's purpose, generates tailored adversarial inputs, applies attack strategies, executes them against the target, grades responses for vulnerability, and produces a structured security report with risk scores.

Usage

Execute this workflow when you need to:

  • Assess an LLM application's resistance to prompt injection and jailbreaking attacks
  • Validate compliance with security frameworks (OWASP LLM Top 10, NIST AI RMF, EU AI Act, MITRE ATLAS)
  • Test guardrail effectiveness against adversarial inputs
  • Generate a security audit report for stakeholder or compliance review
  • Detect model drift in safety behavior after model upgrades

Input state: A red team configuration specifying the target endpoint (HTTP API, model, or custom provider), the system purpose, selected plugins, and attack strategies.

Output state: A vulnerability report showing discovered weaknesses, severity levels, attack success rates, and remediation guidance.

Execution Steps

Step 1: Target Configuration

Define the target LLM application and its security testing parameters. The target can be an HTTP endpoint, a direct model reference, or a custom provider (Python, JavaScript, etc.). The configuration specifies the system purpose (what the application does and what it should refuse), which guides the generation of contextually relevant attacks.

Key considerations:

  • The purpose field is critical; it tells the generator what behaviors to test and what constitutes a vulnerability
  • Targets can be tested via HTTP with request/response transforms
  • The injectVar field specifies which variable receives adversarial content
  • Multiple targets can be tested in a single run for comparative security analysis

Step 2: Plugin Selection

Select the vulnerability categories to test from 50+ available plugins. Each plugin focuses on a specific vulnerability type such as prompt injection, PII disclosure, harmful content generation, excessive agency, or hallucination. Plugins can be selected individually, by compliance framework (OWASP, NIST, EU AI Act), or as curated collections.

Key considerations:

  • Plugins are organized into categories: security, privacy, compliance, ethics, and operational
  • Each plugin can be configured with numTests to control coverage depth
  • Custom severity levels can override defaults for organization-specific risk models
  • Plugin-level graderExamples provide concrete pass/fail examples to improve grading accuracy

Step 3: Attack Generation

Generate adversarial test cases for each selected plugin. The generator first extracts the system purpose and relevant entities (names, organizations, topics) from the target configuration. It then uses a specialized LLM to produce contextually relevant attack prompts tailored to each vulnerability type. The generation process produces numTests (default: 5) test cases per plugin.

Key considerations:

  • Generation uses a configurable provider (default: GPT-5) for creating attacks
  • Remote generation is available via Promptfoo's API for higher-quality adversarial inputs
  • testGenerationInstructions can guide the generator toward specific attack patterns
  • Entity extraction ensures attacks reference realistic details relevant to the target domain

Step 4: Strategy Application

Transform the generated test cases using attack delivery strategies. Strategies wrap base test cases with evasion techniques: encoding (Base64, ROT13, leetspeak), multi-turn conversation (GOAT, Crescendo, Hydra), iterative refinement (tree search, iterative jailbreaks), template-based jailbreaks, and composite techniques. Each strategy produces additional test variants.

Key considerations:

  • Default strategies include jailbreak and prompt-injection
  • Multi-turn strategies (GOAT, Crescendo) use conversational escalation over multiple exchanges
  • The layer strategy composes multiple strategies sequentially for deeper penetration
  • Image, audio, and video strategies test multimodal attack surfaces
  • Strategies can be configured with parameters (e.g., number of turns, encoding depth)

Step 5: Test Execution

Execute all generated and strategy-transformed test cases against the target. Each test case is sent to the target endpoint, and the response is collected. Execution follows the same concurrent evaluation engine as standard evaluations, with rate limiting, caching, and progress tracking.

Key considerations:

  • Execution respects rate limits configured for the target provider
  • Multi-turn strategies maintain conversation state across exchanges
  • HTTP targets support custom headers, authentication, and response extraction
  • The evaluation engine handles retries for transient failures

Step 6: Vulnerability Grading

Grade each response to determine whether the attack succeeded. The grading system uses an LLM judge to evaluate whether the target's response constitutes a vulnerability. Graders consider the attack intent, the response content, and the system purpose to produce a binary pass/fail with confidence scoring. Custom graderGuidance and graderExamples can be provided to calibrate the judge.

Key considerations:

  • Grading is automated using a separate LLM (configurable, default: GPT-5)
  • Custom grader examples reduce false positives by showing what constitutes a real failure
  • Risk scores aggregate vulnerability results into severity levels (critical, high, medium, low)
  • The grading model can be changed to use a more capable model for higher accuracy

Step 7: Report Generation

Produce a structured vulnerability report with findings organized by category, severity, and attack type. The report includes vulnerability descriptions, example attack/response pairs, risk scores, compliance framework mappings, and remediation recommendations. Reports are viewable in the web UI or exportable as JSON for integration with security tooling.

Key considerations:

  • Reports map findings to compliance frameworks (OWASP LLM Top 10, NIST AI RMF, EU AI Act)
  • Severity levels follow a standardized scale with configurable thresholds
  • The web UI provides interactive filtering, sorting, and drill-down into individual findings
  • Reports can be shared via generated URLs for team review
  • CI/CD integration supports fail-on-threshold for security gates

Execution Diagram

GitHub URL

Workflow Repository