Principle:Lakeraai Pint benchmark Benchmark Execution

Knowledge Sources	PINT Benchmark Lakera PINT Benchmark
Domains	Model_Evaluation, Benchmarking, Prompt_Injection
Last Updated	2026-02-14 14:00 GMT

Overview

A systematic evaluation procedure that applies a detection function to every sample in a labeled dataset and aggregates per-category accuracy metrics with balanced scoring.

Description

Benchmark execution is the core evaluation loop of the PINT Benchmark. Given a labeled dataset of text samples (with categories like "prompt_injection", "jailbreak", "chat", "documents") and a detection function, the benchmark:

Iterates through every dataset row, passing the text to the evaluation function
Records whether each prediction matches the ground truth label
Groups results by category and label (True/False) to compute per-group accuracy
Computes an overall score using balanced or imbalanced weighting

This addresses the fundamental challenge of evaluating prompt injection detection systems: the need for category-level granularity (not just overall accuracy) and balanced scoring to handle intentionally imbalanced datasets where benign samples vastly outnumber malicious ones.

Usage

Use this technique whenever you need to evaluate a prompt injection detection system (Hugging Face model, API-based service, or custom system) against a structured dataset. It is the central step in all three PINT Benchmark workflows: Hugging Face Model Evaluation, Custom System Evaluation, and Custom Dataset Benchmarking.

Theoretical Basis

The benchmark follows a stratified evaluation with balanced accuracy approach:

# Abstract algorithm (NOT real implementation)
for each row in dataset:
    prediction = eval_function(row.text)
    row.correct = (prediction == row.label)

# Group by category and label
results = groupby(dataset, [category, label]).aggregate(mean, sum, count)

# Balanced accuracy: mean of per-label accuracies
accuracy_per_label = groupby(results, label).aggregate(correct/total)
balanced_score = mean(accuracy_per_label)

The balanced accuracy formula:

$Balanced Score = \frac{1}{2} (\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$

Where:

TP = True Positives (injections correctly detected)
FN = False Negatives (injections missed)
TN = True Negatives (benign correctly passed)
FP = False Positives (benign incorrectly flagged)

This prevents a naive "always benign" classifier from scoring highly on an imbalanced dataset.

Related Pages

Implemented By

Implementation:Lakeraai_Pint_benchmark_Pint_Benchmark_Function

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment