Principle:Lakeraai Pint benchmark Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Benchmarking, Prompt_Injection |
| Last Updated | 2026-02-14 14:00 GMT |
Overview
A data ingestion pattern that normalizes diverse prompt injection datasets into a standardized schema required by the PINT Benchmark evaluation loop.
Description
The PINT Benchmark requires input data in a specific schema: a pandas DataFrame with three columns — text (the input prompt), category (a grouping label for results aggregation), and label (a boolean indicating whether the text is a prompt injection). This principle covers the process of acquiring, formatting, and loading datasets into this schema.
Datasets can originate from multiple sources:
- YAML files: The default PINT dataset is distributed as a YAML file with entries containing
text,category, andlabelfields. - HuggingFace Hub: Public datasets can be loaded via the
datasetslibrary and converted to DataFrames with appropriate column mapping. - Custom sources: Any data source that can be loaded into a pandas DataFrame.
The key challenge is schema normalization: different datasets use different column names, label formats, and category structures. The preparation step must map these to the PINT schema.
Usage
Use this pattern when you want to benchmark a detection system against a custom dataset (not the default PINT dataset). This is the first step in the Custom Dataset Benchmarking workflow and requires understanding both the source data format and the PINT schema requirements.
Theoretical Basis
The dataset preparation follows an ETL (Extract, Transform, Load) pattern:
# Abstract algorithm (NOT real implementation)
# 1. EXTRACT: Acquire raw data
raw_data = load_from_source(source) # YAML, HF Hub, CSV, etc.
# 2. TRANSFORM: Map to PINT schema
dataframe = pd.DataFrame(raw_data)
dataframe["text"] = dataframe[source_text_column]
dataframe["category"] = derive_category(dataframe) # User-defined mapping
dataframe["label"] = derive_label(dataframe) # True = injection, False = benign
# 3. LOAD: Pass to benchmark
pint_benchmark(df=dataframe, ...)
Schema requirements:
| Column | Type | Description |
|---|---|---|
| text | str | The input text to evaluate |
| category | str | Grouping label (e.g. "prompt_injection", "chat", "documents") |
| label | bool | True if text is a prompt injection, False if benign |
Practical Guide
YAML Path
If your data is in YAML format matching the PINT schema:
# Load directly — no transformation needed
yaml_data = YAML().load(Path("path/to/dataset.yaml"))
df = pd.DataFrame.from_records(yaml_data)
HuggingFace Hub Path
If loading from HuggingFace:
- Load the dataset with
load_dataset - Convert to DataFrame with
pd.DataFrame(dataset['split']) - Map columns: ensure
text,category, andlabelexist - Set
labelto boolean True/False
Custom Data Path
For any other source:
- Load into a DataFrame using appropriate pandas reader
- Rename or create
text,category,labelcolumns - Ensure
labelis boolean (not string "true"/"false")