Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lakeraai Pint benchmark Custom Dataset Benchmarking

From Leeroopedia
Knowledge Sources
Domains AI_Security, Benchmarking, Data_Engineering
Last Updated 2026-02-14 14:00 GMT

Overview

End-to-end process for running the PINT Benchmark against a custom dataset by preparing data in the required schema and passing it to the benchmark function as a YAML file or pandas DataFrame.

Description

This workflow covers the procedure for evaluating any prompt injection detection system on a user-provided dataset rather than the default PINT dataset. The PINT Benchmark accepts data in two formats: a YAML file with a specific schema (text, category, label) or a pandas DataFrame with the same columns. This enables users to benchmark against domain-specific inputs, private datasets, or datasets from Hugging Face Hub. The process covers dataset acquisition, schema formatting, loading into the benchmark, and interpreting results on the custom data.

Usage

Execute this workflow when you need to evaluate a prompt injection detection system on data beyond the default PINT dataset. Typical triggers include testing against domain-specific prompts, validating detection on a particular language, assessing performance on proprietary attack patterns, or reproducing evaluations on publicly available datasets from Hugging Face Hub (e.g., lakera/gandalf_ignore_instructions).

Execution Steps

Step 1: Dataset Acquisition

Obtain or create the dataset to be used for benchmarking. This may involve downloading a dataset from Hugging Face Hub using the datasets library, exporting data from an internal system, or manually curating test cases. The dataset should contain representative examples of both injection attempts and benign inputs for meaningful evaluation.

Key considerations:

  • Install the datasets library if loading from Hugging Face Hub
  • Ensure the dataset contains both positive (injection) and negative (benign) examples
  • Consider including multiple categories for granular performance analysis
  • Use the example-dataset.yaml in benchmark/data/ as a schema reference

Step 2: Dataset Formatting

Transform the raw dataset into the PINT Benchmark's required schema. Each record must have three fields: text (the input string), category (an arbitrary classification tag), and label (a boolean indicating whether the input is a known injection). The dataset can be structured as a YAML file or a pandas DataFrame.

Key considerations:

  • The text field contains the raw input string to evaluate
  • The category field can use arbitrary labels for grouping results
  • The label field must be a boolean: True for injections, False for benign
  • For YAML format, follow the structure in benchmark/data/example-dataset.yaml
  • For DataFrame format, ensure columns are named exactly text, category, label

Pseudocode:

Load raw data into working format
Map source fields to PINT schema (text, category, label)
Assign boolean labels based on ground truth
Validate schema compliance
Output as YAML file or pandas DataFrame

Step 3: Dataset Loading

Provide the formatted dataset to the pint_benchmark() function. For YAML files, pass the file path via the path argument. For DataFrames, pass the DataFrame object via the dataframe argument. The dataframe argument bypasses the default dataset loading logic entirely.

Key considerations:

  • Use path=Path("path/to/dataset.yaml") for YAML files
  • Use dataframe=df for pandas DataFrame input
  • The dataframe argument takes precedence over path if both are provided
  • Validate the dataset loads correctly with a small subset before running the full benchmark

Step 4: Benchmark Execution

Run the benchmark with the custom dataset. Optionally specify a custom eval_function and model_name if evaluating a system other than the default. The benchmark processes all inputs in the custom dataset and computes per-category accuracy using the categories defined in the dataset.

Key considerations:

  • Custom categories produce custom result columns in the output
  • The balanced score calculation adapts to whatever categories are present
  • Combine custom datasets with custom eval functions for full flexibility
  • Start with the example-dataset.yaml during development for faster iteration

Step 5: Results Review

Examine the benchmark output showing per-category performance on the custom dataset. Since the categories are user-defined, the results table columns reflect the custom category names. Use this to assess detection performance on domain-specific input types.

Key considerations:

  • Custom category names appear as columns in the results table
  • Results are not directly comparable to published PINT scores (different dataset)
  • Save results for comparison across different detection systems on the same custom dataset
  • Consider contributing valuable test cases back to the PINT Benchmark project

Execution Diagram

GitHub URL

Workflow Repository