Principle:Promptfoo Promptfoo Assertion Grading
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Quality_Assurance |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
A multi-strategy grading mechanism that evaluates LLM outputs against deterministic checks, embedding similarity, and LLM-as-judge rubrics.
Description
Assertion Grading is the process of scoring LLM outputs to determine whether they meet expected quality criteria. This is the critical quality gate in the evaluation pipeline.
Promptfoo supports three categories of assertions:
- Deterministic: Exact match, contains, regex, JSON schema validation, cost/latency thresholds
- Similarity-based: Cosine similarity between output and expected embeddings
- Model-graded: Using another LLM to judge output quality against a rubric (llm-rubric, factuality, answer-relevance)
The grading system also supports:
- Threshold-based scoring: A test passes if its weighted assertion score exceeds a configurable threshold
- Named metrics: Assertions can contribute to named metrics for aggregation
- Custom functions: JavaScript, Python, or Ruby functions as assertions
Usage
Use this principle after evaluation execution to determine pass/fail status for each test case. This is the fifth step in the pipeline and directly determines the quality metrics reported to users.
Theoretical Basis
The grading algorithm processes assertions sequentially within each test case:
Pseudo-code Logic:
1. For each assertion in test.assert:
a. Determine assertion type (deterministic, similarity, model-graded, custom)
b. Execute the appropriate matcher function
c. Record: { pass: boolean, score: number, reason: string }
2. Aggregate component results:
a. Calculate weighted score across all assertions
b. Compare against test.threshold (default: 1.0 = all must pass)
c. Determine overall pass/fail
3. Return GradingResult with component details