Workflow:Lakeraai Pint benchmark Custom Dataset Benchmarking

Knowledge Sources	PINT Benchmark Pandas DataFrame Hugging Face Datasets
Domains	AI_Security, Benchmarking, Data_Engineering
Last Updated	2026-02-14 14:00 GMT

Overview

End-to-end process for running the PINT Benchmark against a custom dataset by preparing data in the required schema and passing it to the benchmark function as a YAML file or pandas DataFrame.

Description

This workflow covers the procedure for evaluating any prompt injection detection system on a user-provided dataset rather than the default PINT dataset. The PINT Benchmark accepts data in two formats: a YAML file with a specific schema (text, category, label) or a pandas DataFrame with the same columns. This enables users to benchmark against domain-specific inputs, private datasets, or datasets from Hugging Face Hub. The process covers dataset acquisition, schema formatting, loading into the benchmark, and interpreting results on the custom data.

Usage

Execute this workflow when you need to evaluate a prompt injection detection system on data beyond the default PINT dataset. Typical triggers include testing against domain-specific prompts, validating detection on a particular language, assessing performance on proprietary attack patterns, or reproducing evaluations on publicly available datasets from Hugging Face Hub (e.g., lakera/gandalf_ignore_instructions).

Execution Steps

Step 1: Dataset Acquisition

Obtain or create the dataset to be used for benchmarking. This may involve downloading a dataset from Hugging Face Hub using the datasets library, exporting data from an internal system, or manually curating test cases. The dataset should contain representative examples of both injection attempts and benign inputs for meaningful evaluation.

Key considerations:

Install the datasets library if loading from Hugging Face Hub
Ensure the dataset contains both positive (injection) and negative (benign) examples
Consider including multiple categories for granular performance analysis
Use the example-dataset.yaml in benchmark/data/ as a schema reference

Step 2: Dataset Formatting

Transform the raw dataset into the PINT Benchmark's required schema. Each record must have three fields: text (the input string), category (an arbitrary classification tag), and label (a boolean indicating whether the input is a known injection). The dataset can be structured as a YAML file or a pandas DataFrame.

Key considerations:

The text field contains the raw input string to evaluate
The category field can use arbitrary labels for grouping results
The label field must be a boolean: True for injections, False for benign
For YAML format, follow the structure in benchmark/data/example-dataset.yaml
For DataFrame format, ensure columns are named exactly text, category, label

Pseudocode:

Load raw data into working format
Map source fields to PINT schema (text, category, label)
Assign boolean labels based on ground truth
Validate schema compliance
Output as YAML file or pandas DataFrame

Step 3: Dataset Loading

Provide the formatted dataset to the pint_benchmark() function. For YAML files, pass the file path via the path argument. For DataFrames, pass the DataFrame object via the dataframe argument. The dataframe argument bypasses the default dataset loading logic entirely.

Key considerations:

Use path=Path("path/to/dataset.yaml") for YAML files
Use dataframe=df for pandas DataFrame input
The dataframe argument takes precedence over path if both are provided
Validate the dataset loads correctly with a small subset before running the full benchmark

Step 4: Benchmark Execution

Run the benchmark with the custom dataset. Optionally specify a custom eval_function and model_name if evaluating a system other than the default. The benchmark processes all inputs in the custom dataset and computes per-category accuracy using the categories defined in the dataset.

Key considerations:

Custom categories produce custom result columns in the output
The balanced score calculation adapts to whatever categories are present
Combine custom datasets with custom eval functions for full flexibility
Start with the example-dataset.yaml during development for faster iteration

Step 5: Results Review

Examine the benchmark output showing per-category performance on the custom dataset. Since the categories are user-defined, the results table columns reflect the custom category names. Use this to assess detection performance on domain-specific input types.

Key considerations:

Custom category names appear as columns in the results table
Results are not directly comparable to published PINT scores (different dataset)
Save results for comparison across different detection systems on the same custom dataset
Consider contributing valuable test cases back to the PINT Benchmark project

Execution Diagram

GitHub URL

Workflow Repository