Principle:Liu00222 Open Prompt Injection Text Segmentation

Knowledge Sources	Open-Prompt-Injection
Domains	NLP, Tokenization, Text_Processing
Last Updated	2026-02-14 15:00 GMT

Overview

A hybrid text segmentation technique that combines sentence boundary detection with embedding-based similarity analysis to split text at both syntactic and semantic boundaries.

Description

Text Segmentation in the PromptLocate pipeline splits input text into granular segments for injection boundary detection. It operates in two stages: (1) Sentence splitting using a custom spaCy pipeline with regex-based sentence boundary detection (splitting on periods, exclamation marks, question marks, and double newlines), and (2) Embedding-based splitting that further divides sentences at word boundaries where consecutive word embeddings have low cosine similarity, indicating a semantic discontinuity. This dual approach captures both natural sentence boundaries and unnatural semantic shifts that injections often introduce.

Usage

Use this principle as the first step in the PromptLocate localization pipeline. The quality of segmentation directly affects the precision of injection boundary detection — finer segments enable more accurate localization.

Theoretical Basis

The segmentation combines syntactic and semantic signals:

Pseudo-code Logic:

# Stage 1: Sentence splitting
sentences = spacy_custom_segmenter(text)  # regex: r'(?:[.!?\n]{2,}|[.!?\n])(?:["\']?)'

# Stage 2: Embedding-based splitting
for each sentence:
    words = tokenize(sentence)
    embeddings = get_word_embeddings(words)
    for i in range(len(words)-1):
        sim = cosine_similarity(embeddings[i], embeddings[i+1])
        if sim < threshold:
            split_here()  # Low similarity = potential boundary

# Stage 3: Clean up
segments = merge_empty_segments(segments)

The cosine similarity threshold (default 0.0) determines how aggressively text is split at semantic boundaries.

Related Pages

Implemented By

Implementation:Liu00222_Open_Prompt_Injection_split_sentence

Uses Heuristic

Heuristic:Liu00222_Open_Prompt_Injection_Cosine_Similarity_Segmentation_Threshold

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment