Workflow:Lakeraai Pint benchmark Custom System Evaluation
| Knowledge Sources | |
|---|---|
| Domains | AI_Security, Benchmarking, Prompt_Injection_Detection |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
End-to-end process for benchmarking a custom (non-Hugging Face) prompt injection detection system against the PINT dataset by writing a custom evaluation function callback.
Description
This workflow covers the procedure for evaluating any prompt injection detection system that is not a standard Hugging Face model. This includes commercial API services (such as Lakera Guard, AWS Bedrock Guardrails, Azure AI Prompt Shield, Google Model Armor), third-party libraries (such as WhyLabs LangKit), or custom in-house detection logic. The user writes a thin adapter function that accepts a single text string and returns a boolean indicating whether an injection was detected. This function is then passed to pint_benchmark() as a callback, and the benchmark handles dataset iteration, scoring, and result generation.
Usage
Execute this workflow when you want to benchmark a prompt injection detection system that does not use the standard Hugging Face text-classification pipeline. This includes API-based commercial services, custom rule-based detectors, ensemble systems, or any detection tool that exposes a programmatic interface returning injection verdicts. The custom eval function pattern is the most flexible integration path into the PINT Benchmark.
Execution Steps
Step 1: Environment Setup
Install the PINT Benchmark dependencies via Poetry and any additional libraries required by the target detection system. For API-based services, configure authentication credentials (API keys, endpoints). For library-based tools, install the relevant package in the benchmark notebook.
Key considerations:
- Use Poetry for the base environment (poetry install)
- Install detection system dependencies via pip in the notebook
- Store API keys securely; do not commit credentials to the repository
- Verify network connectivity for cloud-based detection APIs
Step 2: Define Evaluation Function
Write a custom Python function that conforms to the PINT Benchmark's callback interface. The function must accept a single str parameter (the input text to classify) and return a bool (True if prompt injection is detected, False otherwise). The function encapsulates all interaction with the detection system, including API calls, response parsing, and threshold application.
Key considerations:
- Function signature must be eval_function(text: str) -> bool
- Handle any necessary response parsing (e.g., confidence thresholds, label mapping)
- Consider rate limiting for API-based services
- The function is called once per dataset input (4,314 times for the full PINT dataset)
Pseudocode:
Define function accepting text string Call detection system with text Parse response to extract injection verdict Apply threshold or label mapping Return boolean result
Step 3: Benchmark Execution
Pass the custom evaluation function to pint_benchmark() along with a descriptive model_name string. The benchmark loads the PINT dataset, iterates over all inputs, calls the evaluation function for each, collects predictions, and computes per-category accuracy and the balanced PINT score.
Key considerations:
- Provide a descriptive model_name for clear result labeling
- For slow APIs, benchmark execution may require extended runtime
- The benchmark uses the default PINT dataset unless a custom dataset or DataFrame is provided
- Error handling in the eval function prevents individual failures from halting the full run
Step 4: Results Analysis
Review the benchmark output table showing per-category accuracy breakdown and the overall balanced PINT score. Compare results against the published leaderboard. Analyze category-specific performance to understand the system's strengths (e.g., jailbreak detection) and weaknesses (e.g., false positive rate on hard negatives).
Key considerations:
- The balanced score weights all categories equally
- High hard_negatives accuracy indicates low false positive rate
- Results should be verified by the Lakera team before official inclusion in the leaderboard
- Save results as markdown for reproducibility