Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Avdvg InjectGuard Evaluation And Metrics

From Leeroopedia
Knowledge Sources
Domains Evaluation, Machine_Learning, Security
Last Updated 2026-02-14 16:00 GMT

Overview

A systematic methodology for measuring the effectiveness of a binary classification system using standard metrics: accuracy, precision, recall, and F1-score.

Description

Evaluation and metrics is the process of quantifying how well the detection system performs on a labeled test dataset. In the context of prompt injection detection, this means running every sample in a test set through the detection function, collecting predictions, and computing aggregate performance metrics against ground-truth labels.

The four standard binary classification metrics used are:

  • Accuracy: Fraction of correct predictions overall. Can be misleading with class imbalance.
  • Precision: Of all inputs flagged as malicious, what fraction truly are. Measures false positive rate.
  • Recall: Of all truly malicious inputs, what fraction were detected. Measures false negative rate.
  • F1-score: Harmonic mean of precision and recall. Balances both error types.

For security applications, recall is often prioritized (missing a real attack is more dangerous than a false alarm), but the threshold parameter allows operators to tune this tradeoff.

Usage

Use this principle whenever validating or benchmarking a detection system. It should be applied on a held-out labeled test set that was not used to build the vector store. It is also useful for comparing different threshold values (sim_k) or different embedding models.

Theoretical Basis

Given predictions y^ and true labels y for a binary classification task:

Accuracy=TP+TNTP+TN+FP+FN

Precision=TPTP+FP

Recall=TPTP+FN

F1=2PrecisionRecallPrecision+Recall

Where:

  • TP = True Positives (correctly detected malicious inputs)
  • TN = True Negatives (correctly passed benign inputs)
  • FP = False Positives (benign inputs incorrectly flagged)
  • FN = False Negatives (malicious inputs missed)

Pseudo-code:

# Abstract evaluation algorithm
predictions = []
labels = []
for sample in test_dataset:
    pred = detect(sample.text, threshold)
    predictions.append(pred)
    labels.append(sample.label)

accuracy  = count_correct(predictions, labels) / len(labels)
precision = true_positives / (true_positives + false_positives)
recall    = true_positives / (true_positives + false_negatives)
f1        = 2 * precision * recall / (precision + recall)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment