Principle:Liu00222 Open Prompt Injection Text Segmentation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization, Text_Processing |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
A hybrid text segmentation technique that combines sentence boundary detection with embedding-based similarity analysis to split text at both syntactic and semantic boundaries.
Description
Text Segmentation in the PromptLocate pipeline splits input text into granular segments for injection boundary detection. It operates in two stages: (1) Sentence splitting using a custom spaCy pipeline with regex-based sentence boundary detection (splitting on periods, exclamation marks, question marks, and double newlines), and (2) Embedding-based splitting that further divides sentences at word boundaries where consecutive word embeddings have low cosine similarity, indicating a semantic discontinuity. This dual approach captures both natural sentence boundaries and unnatural semantic shifts that injections often introduce.
Usage
Use this principle as the first step in the PromptLocate localization pipeline. The quality of segmentation directly affects the precision of injection boundary detection — finer segments enable more accurate localization.
Theoretical Basis
The segmentation combines syntactic and semantic signals:
Pseudo-code Logic:
# Stage 1: Sentence splitting
sentences = spacy_custom_segmenter(text) # regex: r'(?:[.!?\n]{2,}|[.!?\n])(?:["\']?)'
# Stage 2: Embedding-based splitting
for each sentence:
words = tokenize(sentence)
embeddings = get_word_embeddings(words)
for i in range(len(words)-1):
sim = cosine_similarity(embeddings[i], embeddings[i+1])
if sim < threshold:
split_here() # Low similarity = potential boundary
# Stage 3: Clean up
segments = merge_empty_segments(segments)
The cosine similarity threshold (default 0.0) determines how aggressively text is split at semantic boundaries.