Principle:Avdvg InjectGuard Evaluation And Metrics
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Machine_Learning, Security |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
A systematic methodology for measuring the effectiveness of a binary classification system using standard metrics: accuracy, precision, recall, and F1-score.
Description
Evaluation and metrics is the process of quantifying how well the detection system performs on a labeled test dataset. In the context of prompt injection detection, this means running every sample in a test set through the detection function, collecting predictions, and computing aggregate performance metrics against ground-truth labels.
The four standard binary classification metrics used are:
- Accuracy: Fraction of correct predictions overall. Can be misleading with class imbalance.
- Precision: Of all inputs flagged as malicious, what fraction truly are. Measures false positive rate.
- Recall: Of all truly malicious inputs, what fraction were detected. Measures false negative rate.
- F1-score: Harmonic mean of precision and recall. Balances both error types.
For security applications, recall is often prioritized (missing a real attack is more dangerous than a false alarm), but the threshold parameter allows operators to tune this tradeoff.
Usage
Use this principle whenever validating or benchmarking a detection system. It should be applied on a held-out labeled test set that was not used to build the vector store. It is also useful for comparing different threshold values (sim_k) or different embedding models.
Theoretical Basis
Given predictions and true labels for a binary classification task:
Where:
- TP = True Positives (correctly detected malicious inputs)
- TN = True Negatives (correctly passed benign inputs)
- FP = False Positives (benign inputs incorrectly flagged)
- FN = False Negatives (malicious inputs missed)
Pseudo-code:
# Abstract evaluation algorithm
predictions = []
labels = []
for sample in test_dataset:
pred = detect(sample.text, threshold)
predictions.append(pred)
labels.append(sample.label)
accuracy = count_correct(predictions, labels) / len(labels)
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1 = 2 * precision * recall / (precision + recall)