Principle:Norrrrrrr lyn WAInjectBench Training Data Loading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Machine_Learning |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
A data loading pattern that parses JSONL training files into structured arrays of features and labels for embedding-based classifier training.
Description
Training Data Loading reads JSONL files that contain labeled samples for training binary classifiers. The format differs slightly between text and image modalities:
- Text variant: Each line is
{"text": str, "label": int, "source": str}. Returns parallel lists of texts, labels, and sources. - Image variant: Each line is
{"path": str, "label": int}. Returns parallel lists of image paths and labels.
Labels are binary: 1 for malicious (prompt injection), 0 for benign. The load_jsonl function provides a clean interface for parsing these formats, skipping empty lines.
Usage
Use this pattern when preparing data for embedding-based classifier training. It is the first step in both the text embedding and image embedding training pipelines.
Theoretical Basis
# Data loading pattern for labeled training data
features, labels = [], []
for line in file:
data = json.loads(line)
features.append(data[feature_key]) # "text" or "path"
labels.append(data["label"]) # 0 or 1