Workflow:Lakeraai Pint benchmark Custom Dataset Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | AI_Security, Benchmarking, Data_Engineering |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
End-to-end process for running the PINT Benchmark against a custom dataset by preparing data in the required schema and passing it to the benchmark function as a YAML file or pandas DataFrame.
Description
This workflow covers the procedure for evaluating any prompt injection detection system on a user-provided dataset rather than the default PINT dataset. The PINT Benchmark accepts data in two formats: a YAML file with a specific schema (text, category, label) or a pandas DataFrame with the same columns. This enables users to benchmark against domain-specific inputs, private datasets, or datasets from Hugging Face Hub. The process covers dataset acquisition, schema formatting, loading into the benchmark, and interpreting results on the custom data.
Usage
Execute this workflow when you need to evaluate a prompt injection detection system on data beyond the default PINT dataset. Typical triggers include testing against domain-specific prompts, validating detection on a particular language, assessing performance on proprietary attack patterns, or reproducing evaluations on publicly available datasets from Hugging Face Hub (e.g., lakera/gandalf_ignore_instructions).
Execution Steps
Step 1: Dataset Acquisition
Obtain or create the dataset to be used for benchmarking. This may involve downloading a dataset from Hugging Face Hub using the datasets library, exporting data from an internal system, or manually curating test cases. The dataset should contain representative examples of both injection attempts and benign inputs for meaningful evaluation.
Key considerations:
- Install the datasets library if loading from Hugging Face Hub
- Ensure the dataset contains both positive (injection) and negative (benign) examples
- Consider including multiple categories for granular performance analysis
- Use the example-dataset.yaml in benchmark/data/ as a schema reference
Step 2: Dataset Formatting
Transform the raw dataset into the PINT Benchmark's required schema. Each record must have three fields: text (the input string), category (an arbitrary classification tag), and label (a boolean indicating whether the input is a known injection). The dataset can be structured as a YAML file or a pandas DataFrame.
Key considerations:
- The text field contains the raw input string to evaluate
- The category field can use arbitrary labels for grouping results
- The label field must be a boolean: True for injections, False for benign
- For YAML format, follow the structure in benchmark/data/example-dataset.yaml
- For DataFrame format, ensure columns are named exactly text, category, label
Pseudocode:
Load raw data into working format Map source fields to PINT schema (text, category, label) Assign boolean labels based on ground truth Validate schema compliance Output as YAML file or pandas DataFrame
Step 3: Dataset Loading
Provide the formatted dataset to the pint_benchmark() function. For YAML files, pass the file path via the path argument. For DataFrames, pass the DataFrame object via the dataframe argument. The dataframe argument bypasses the default dataset loading logic entirely.
Key considerations:
- Use path=Path("path/to/dataset.yaml") for YAML files
- Use dataframe=df for pandas DataFrame input
- The dataframe argument takes precedence over path if both are provided
- Validate the dataset loads correctly with a small subset before running the full benchmark
Step 4: Benchmark Execution
Run the benchmark with the custom dataset. Optionally specify a custom eval_function and model_name if evaluating a system other than the default. The benchmark processes all inputs in the custom dataset and computes per-category accuracy using the categories defined in the dataset.
Key considerations:
- Custom categories produce custom result columns in the output
- The balanced score calculation adapts to whatever categories are present
- Combine custom datasets with custom eval functions for full flexibility
- Start with the example-dataset.yaml during development for faster iteration
Step 5: Results Review
Examine the benchmark output showing per-category performance on the custom dataset. Since the categories are user-defined, the results table columns reflect the custom category names. Use this to assess detection performance on domain-specific input types.
Key considerations:
- Custom category names appear as columns in the results table
- Results are not directly comparable to published PINT scores (different dataset)
- Save results for comparison across different detection systems on the same custom dataset
- Consider contributing valuable test cases back to the PINT Benchmark project