Principle:Lakeraai Pint benchmark Dataset Preparation

Knowledge Sources	PINT Benchmark Pandas DataFrame HuggingFace Datasets
Domains	Data_Engineering, Benchmarking, Prompt_Injection
Last Updated	2026-02-14 14:00 GMT

Overview

A data ingestion pattern that normalizes diverse prompt injection datasets into a standardized schema required by the PINT Benchmark evaluation loop.

Description

The PINT Benchmark requires input data in a specific schema: a pandas DataFrame with three columns — text (the input prompt), category (a grouping label for results aggregation), and label (a boolean indicating whether the text is a prompt injection). This principle covers the process of acquiring, formatting, and loading datasets into this schema.

Datasets can originate from multiple sources:

YAML files: The default PINT dataset is distributed as a YAML file with entries containing text, category, and label fields.
HuggingFace Hub: Public datasets can be loaded via the datasets library and converted to DataFrames with appropriate column mapping.
Custom sources: Any data source that can be loaded into a pandas DataFrame.

The key challenge is schema normalization: different datasets use different column names, label formats, and category structures. The preparation step must map these to the PINT schema.

Usage

Use this pattern when you want to benchmark a detection system against a custom dataset (not the default PINT dataset). This is the first step in the Custom Dataset Benchmarking workflow and requires understanding both the source data format and the PINT schema requirements.

Theoretical Basis

The dataset preparation follows an ETL (Extract, Transform, Load) pattern:

# Abstract algorithm (NOT real implementation)

# 1. EXTRACT: Acquire raw data
raw_data = load_from_source(source)  # YAML, HF Hub, CSV, etc.

# 2. TRANSFORM: Map to PINT schema
dataframe = pd.DataFrame(raw_data)
dataframe["text"] = dataframe[source_text_column]
dataframe["category"] = derive_category(dataframe)  # User-defined mapping
dataframe["label"] = derive_label(dataframe)          # True = injection, False = benign

# 3. LOAD: Pass to benchmark
pint_benchmark(df=dataframe, ...)

Schema requirements:

Column	Type	Description
text	str	The input text to evaluate
category	str	Grouping label (e.g. "prompt_injection", "chat", "documents")
label	bool	True if text is a prompt injection, False if benign

Practical Guide

YAML Path

If your data is in YAML format matching the PINT schema:

# Load directly — no transformation needed
yaml_data = YAML().load(Path("path/to/dataset.yaml"))
df = pd.DataFrame.from_records(yaml_data)

HuggingFace Hub Path

If loading from HuggingFace:

Load the dataset with load_dataset
Convert to DataFrame with pd.DataFrame(dataset['split'])
Map columns: ensure text, category, and label exist
Set label to boolean True/False

Custom Data Path

For any other source:

Load into a DataFrame using appropriate pandas reader
Rename or create text, category, label columns
Ensure label is boolean (not string "true"/"false")

Related Pages

Implemented By

Implementation:Lakeraai_Pint_benchmark_Dataset_Loading_And_Formatting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment