Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lakeraai Pint benchmark Dataset Preparation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Benchmarking, Prompt_Injection
Last Updated 2026-02-14 14:00 GMT

Overview

A data ingestion pattern that normalizes diverse prompt injection datasets into a standardized schema required by the PINT Benchmark evaluation loop.

Description

The PINT Benchmark requires input data in a specific schema: a pandas DataFrame with three columns — text (the input prompt), category (a grouping label for results aggregation), and label (a boolean indicating whether the text is a prompt injection). This principle covers the process of acquiring, formatting, and loading datasets into this schema.

Datasets can originate from multiple sources:

  • YAML files: The default PINT dataset is distributed as a YAML file with entries containing text, category, and label fields.
  • HuggingFace Hub: Public datasets can be loaded via the datasets library and converted to DataFrames with appropriate column mapping.
  • Custom sources: Any data source that can be loaded into a pandas DataFrame.

The key challenge is schema normalization: different datasets use different column names, label formats, and category structures. The preparation step must map these to the PINT schema.

Usage

Use this pattern when you want to benchmark a detection system against a custom dataset (not the default PINT dataset). This is the first step in the Custom Dataset Benchmarking workflow and requires understanding both the source data format and the PINT schema requirements.

Theoretical Basis

The dataset preparation follows an ETL (Extract, Transform, Load) pattern:

# Abstract algorithm (NOT real implementation)

# 1. EXTRACT: Acquire raw data
raw_data = load_from_source(source)  # YAML, HF Hub, CSV, etc.

# 2. TRANSFORM: Map to PINT schema
dataframe = pd.DataFrame(raw_data)
dataframe["text"] = dataframe[source_text_column]
dataframe["category"] = derive_category(dataframe)  # User-defined mapping
dataframe["label"] = derive_label(dataframe)          # True = injection, False = benign

# 3. LOAD: Pass to benchmark
pint_benchmark(df=dataframe, ...)

Schema requirements:

Column Type Description
text str The input text to evaluate
category str Grouping label (e.g. "prompt_injection", "chat", "documents")
label bool True if text is a prompt injection, False if benign

Practical Guide

YAML Path

If your data is in YAML format matching the PINT schema:

# Load directly — no transformation needed
yaml_data = YAML().load(Path("path/to/dataset.yaml"))
df = pd.DataFrame.from_records(yaml_data)

HuggingFace Hub Path

If loading from HuggingFace:

  1. Load the dataset with load_dataset
  2. Convert to DataFrame with pd.DataFrame(dataset['split'])
  3. Map columns: ensure text, category, and label exist
  4. Set label to boolean True/False

Custom Data Path

For any other source:

  1. Load into a DataFrame using appropriate pandas reader
  2. Rename or create text, category, label columns
  3. Ensure label is boolean (not string "true"/"false")

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment