Principle:Norrrrrrr lyn WAInjectBench JSONL Text Dataset Format
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
A line-delimited JSON data format that organizes text samples with metadata for streaming-compatible prompt injection detection benchmarks.
Description
JSONL (JSON Lines) is a text format where each line is a valid JSON object. For text-based prompt injection detection, each line represents a single text sample with an identifier and the text content. This format enables line-by-line streaming, simple appending, and easy integration with Unix tools. The WAInjectBench benchmark organizes these files into benign/ and malicious/ subdirectories, where the directory structure itself encodes the ground-truth label.
Usage
Use this format whenever preparing or consuming text data for the text prompt injection detection pipeline. Each JSONL file in the data/text/benign/ or data/text/malicious/ directory represents one dataset scenario.
Theoretical Basis
The JSONL schema for text detection is:
# Each line in a .jsonl file:
{"id": int, "text": str}
Directory layout:
data/text/
├── benign/
│ ├── scenario_a.jsonl # Each line: {"id": 1, "text": "..."}
│ └── scenario_b.jsonl
└── malicious/
├── attack_x.jsonl
└── attack_y.jsonl
The folder name (benign vs malicious) determines the ground-truth label for metric computation (FPR for benign, TPR for malicious). Files are discovered via folder_path.glob("*.jsonl").