Principle:Avdvg InjectGuard Malicious Dataset Loading
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Security, NLP |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
A data ingestion technique that loads a curated collection of known malicious prompts from structured storage into a document representation suitable for downstream vector indexing.
Description
Malicious dataset loading is the process of reading a curated corpus of known prompt injection attacks and jailbreak attempts from a structured file format (CSV) and converting each entry into a document object. This corpus serves as the "attack signature database" for vector similarity detection: incoming user prompts are compared against these known attacks to determine if they are malicious.
The quality and coverage of this dataset directly determines the detection system's recall. Key considerations include:
- Dataset format: Each row must contain at minimum an identifier and the malicious text. The InjectGuard system expects CSV files with columns id and text.
- Coverage: The dataset should include diverse attack categories (direct injection, indirect injection, jailbreak prompts, role-play attacks, encoding-based evasion).
- Document wrapping: Raw CSV rows are wrapped into document objects that carry both the text content and metadata, enabling the vector store to index and retrieve them with provenance.
Usage
Use this principle whenever building a signature-based or similarity-based detection system that relies on a known-attack corpus. It is the data foundation step that must precede vector store construction. The dataset must be prepared offline and updated as new attack patterns emerge.
Theoretical Basis
The malicious dataset loading step follows a standard ETL (Extract-Transform-Load) pattern:
Pseudo-code:
# Abstract algorithm for malicious dataset loading
raw_rows = read_csv(file_path) # Extract
documents = []
for row in raw_rows:
doc = Document( # Transform
page_content=row["text"],
metadata={"id": row["id"], "source": file_path}
)
documents.append(doc)
# documents are now ready for vector indexing # Load (next step)
The transformation step is critical: it converts raw tabular data into a document abstraction that carries both content (for embedding) and metadata (for traceability). This separation allows the vector store to return not just similarity scores but also the identity of the matched malicious prompt.