Principle:Lakeraai Pint benchmark Benchmark Execution
| Knowledge Sources | |
|---|---|
| Domains | Model_Evaluation, Benchmarking, Prompt_Injection |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
A systematic evaluation procedure that applies a detection function to every sample in a labeled dataset and aggregates per-category accuracy metrics with balanced scoring.
Description
Benchmark execution is the core evaluation loop of the PINT Benchmark. Given a labeled dataset of text samples (with categories like "prompt_injection", "jailbreak", "chat", "documents") and a detection function, the benchmark:
- Iterates through every dataset row, passing the text to the evaluation function
- Records whether each prediction matches the ground truth label
- Groups results by category and label (True/False) to compute per-group accuracy
- Computes an overall score using balanced or imbalanced weighting
This addresses the fundamental challenge of evaluating prompt injection detection systems: the need for category-level granularity (not just overall accuracy) and balanced scoring to handle intentionally imbalanced datasets where benign samples vastly outnumber malicious ones.
Usage
Use this technique whenever you need to evaluate a prompt injection detection system (Hugging Face model, API-based service, or custom system) against a structured dataset. It is the central step in all three PINT Benchmark workflows: Hugging Face Model Evaluation, Custom System Evaluation, and Custom Dataset Benchmarking.
Theoretical Basis
The benchmark follows a stratified evaluation with balanced accuracy approach:
# Abstract algorithm (NOT real implementation)
for each row in dataset:
prediction = eval_function(row.text)
row.correct = (prediction == row.label)
# Group by category and label
results = groupby(dataset, [category, label]).aggregate(mean, sum, count)
# Balanced accuracy: mean of per-label accuracies
accuracy_per_label = groupby(results, label).aggregate(correct/total)
balanced_score = mean(accuracy_per_label)
The balanced accuracy formula:
Where:
- TP = True Positives (injections correctly detected)
- FN = False Negatives (injections missed)
- TN = True Negatives (benign correctly passed)
- FP = False Positives (benign incorrectly flagged)
This prevents a naive "always benign" classifier from scoring highly on an imbalanced dataset.