Principle:Avdvg InjectGuard Evaluation And Metrics

Knowledge Sources	A systematic analysis of performance measures for classification tasks InjectGuard
Domains	Evaluation, Machine_Learning, Security
Last Updated	2026-02-14 16:00 GMT

Overview

A systematic methodology for measuring the effectiveness of a binary classification system using standard metrics: accuracy, precision, recall, and F1-score.

Description

Evaluation and metrics is the process of quantifying how well the detection system performs on a labeled test dataset. In the context of prompt injection detection, this means running every sample in a test set through the detection function, collecting predictions, and computing aggregate performance metrics against ground-truth labels.

The four standard binary classification metrics used are:

Accuracy: Fraction of correct predictions overall. Can be misleading with class imbalance.
Precision: Of all inputs flagged as malicious, what fraction truly are. Measures false positive rate.
Recall: Of all truly malicious inputs, what fraction were detected. Measures false negative rate.
F1-score: Harmonic mean of precision and recall. Balances both error types.

For security applications, recall is often prioritized (missing a real attack is more dangerous than a false alarm), but the threshold parameter allows operators to tune this tradeoff.

Usage

Use this principle whenever validating or benchmarking a detection system. It should be applied on a held-out labeled test set that was not used to build the vector store. It is also useful for comparing different threshold values (sim_k) or different embedding models.

Theoretical Basis

Given predictions $\hat{y}$ and true labels $y$ for a binary classification task:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N}$

$Precision = \frac{T P}{T P + F P}$

$Recall = \frac{T P}{T P + F N}$

$F_{1} = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

Where:

TP = True Positives (correctly detected malicious inputs)
TN = True Negatives (correctly passed benign inputs)
FP = False Positives (benign inputs incorrectly flagged)
FN = False Negatives (malicious inputs missed)

Pseudo-code:

# Abstract evaluation algorithm
predictions = []
labels = []
for sample in test_dataset:
    pred = detect(sample.text, threshold)
    predictions.append(pred)
    labels.append(sample.label)

accuracy  = count_correct(predictions, labels) / len(labels)
precision = true_positives / (true_positives + false_positives)
recall    = true_positives / (true_positives + false_negatives)
f1        = 2 * precision * recall / (precision + recall)

Related Pages

Implemented By

Implementation:Avdvg_InjectGuard_Metric_And_Main

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment