Principle:Promptfoo Promptfoo Vulnerability Grading

Knowledge Sources	Promptfoo Red Team Grading
Domains	Security_Testing, Vulnerability_Assessment
Last Updated	2026-02-14 08:00 GMT

Overview

An LLM-as-judge grading mechanism that evaluates target responses to adversarial attacks for vulnerability detection.

Description

Vulnerability Grading uses a grading LLM to assess whether a target system's response to an adversarial attack indicates a vulnerability. Unlike standard assertion grading (which checks for expected outputs), vulnerability grading looks for undesirable behaviors: compliance with harmful requests, information leakage, or failure to refuse inappropriate queries.

Each plugin defines a rubric template that is rendered with the attack context (purpose, entities, prompt, response) and then evaluated by the grading LLM. The grader also performs refusal detection to distinguish genuine vulnerabilities from refused requests.

Usage

Use this principle when evaluating red team test results. Each plugin's grader is automatically invoked during the evaluation execution phase to score each adversarial test case.

Theoretical Basis

Pseudo-code Logic:

1. For each red team test result:
   a. Render plugin-specific rubric with test context variables
   b. Check for refusal patterns (unless skip flag set)
   c. Send rubric + target response to grading LLM
   d. Parse LLM judgment: { pass: boolean, score: number, reason: string }
   e. Generate remediation suggestions if vulnerability found
2. Return GradingResult with rubric text and suggestions

Related Pages

Implemented By

Implementation:Promptfoo_Promptfoo_RedteamGraderBase_getResult

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment