Principle:Norrrrrrr lyn WAInjectBench JSONL Text Dataset Format

Knowledge Sources	JSON Lines
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 16:00 GMT

Overview

A line-delimited JSON data format that organizes text samples with metadata for streaming-compatible prompt injection detection benchmarks.

Description

JSONL (JSON Lines) is a text format where each line is a valid JSON object. For text-based prompt injection detection, each line represents a single text sample with an identifier and the text content. This format enables line-by-line streaming, simple appending, and easy integration with Unix tools. The WAInjectBench benchmark organizes these files into benign/ and malicious/ subdirectories, where the directory structure itself encodes the ground-truth label.

Usage

Use this format whenever preparing or consuming text data for the text prompt injection detection pipeline. Each JSONL file in the data/text/benign/ or data/text/malicious/ directory represents one dataset scenario.

Theoretical Basis

The JSONL schema for text detection is:

# Each line in a .jsonl file:
{"id": int, "text": str}

Directory layout:

data/text/
├── benign/
│   ├── scenario_a.jsonl    # Each line: {"id": 1, "text": "..."}
│   └── scenario_b.jsonl
└── malicious/
    ├── attack_x.jsonl
    └── attack_y.jsonl

The folder name (benign vs malicious) determines the ground-truth label for metric computation (FPR for benign, TPR for malicious). Files are discovered via folder_path.glob("*.jsonl").

Related Pages

Implemented By

Implementation:Norrrrrrr_lyn_WAInjectBench_JSONL_Text_Data_Schema

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment