Principle:Avdvg InjectGuard Malicious Dataset Loading

Knowledge Sources	Prompt Injection Attacks and Defenses InjectGuard
Domains	Data_Engineering, Security, NLP
Last Updated	2026-02-14 16:00 GMT

Overview

A data ingestion technique that loads a curated collection of known malicious prompts from structured storage into a document representation suitable for downstream vector indexing.

Description

Malicious dataset loading is the process of reading a curated corpus of known prompt injection attacks and jailbreak attempts from a structured file format (CSV) and converting each entry into a document object. This corpus serves as the "attack signature database" for vector similarity detection: incoming user prompts are compared against these known attacks to determine if they are malicious.

The quality and coverage of this dataset directly determines the detection system's recall. Key considerations include:

Dataset format: Each row must contain at minimum an identifier and the malicious text. The InjectGuard system expects CSV files with columns id and text.
Coverage: The dataset should include diverse attack categories (direct injection, indirect injection, jailbreak prompts, role-play attacks, encoding-based evasion).
Document wrapping: Raw CSV rows are wrapped into document objects that carry both the text content and metadata, enabling the vector store to index and retrieve them with provenance.

Usage

Use this principle whenever building a signature-based or similarity-based detection system that relies on a known-attack corpus. It is the data foundation step that must precede vector store construction. The dataset must be prepared offline and updated as new attack patterns emerge.

Theoretical Basis

The malicious dataset loading step follows a standard ETL (Extract-Transform-Load) pattern:

Pseudo-code:

# Abstract algorithm for malicious dataset loading
raw_rows = read_csv(file_path)            # Extract
documents = []
for row in raw_rows:
    doc = Document(                        # Transform
        page_content=row["text"],
        metadata={"id": row["id"], "source": file_path}
    )
    documents.append(doc)
# documents are now ready for vector indexing  # Load (next step)

The transformation step is critical: it converts raw tabular data into a document abstraction that carries both content (for embedding) and metadata (for traceability). This separation allows the vector store to return not just similarity scores but also the identity of the matched malicious prompt.

Related Pages

Implemented By

Implementation:Avdvg_InjectGuard_CSVLoader_Load

Uses Heuristic

Heuristic:Avdvg_InjectGuard_Dataset_Coverage_Recall_Bound

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment