Principle:Promptfoo Promptfoo Vulnerability Grading
| Knowledge Sources | |
|---|---|
| Domains | Security_Testing, Vulnerability_Assessment |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
An LLM-as-judge grading mechanism that evaluates target responses to adversarial attacks for vulnerability detection.
Description
Vulnerability Grading uses a grading LLM to assess whether a target system's response to an adversarial attack indicates a vulnerability. Unlike standard assertion grading (which checks for expected outputs), vulnerability grading looks for undesirable behaviors: compliance with harmful requests, information leakage, or failure to refuse inappropriate queries.
Each plugin defines a rubric template that is rendered with the attack context (purpose, entities, prompt, response) and then evaluated by the grading LLM. The grader also performs refusal detection to distinguish genuine vulnerabilities from refused requests.
Usage
Use this principle when evaluating red team test results. Each plugin's grader is automatically invoked during the evaluation execution phase to score each adversarial test case.
Theoretical Basis
Pseudo-code Logic:
1. For each red team test result:
a. Render plugin-specific rubric with test context variables
b. Check for refusal patterns (unless skip flag set)
c. Send rubric + target response to grading LLM
d. Parse LLM judgment: { pass: boolean, score: number, reason: string }
e. Generate remediation suggestions if vulnerability found
2. Return GradingResult with rubric text and suggestions