Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Liu00222 Open Prompt Injection Text Segmentation

From Leeroopedia
Knowledge Sources
Domains NLP, Tokenization, Text_Processing
Last Updated 2026-02-14 15:00 GMT

Overview

A hybrid text segmentation technique that combines sentence boundary detection with embedding-based similarity analysis to split text at both syntactic and semantic boundaries.

Description

Text Segmentation in the PromptLocate pipeline splits input text into granular segments for injection boundary detection. It operates in two stages: (1) Sentence splitting using a custom spaCy pipeline with regex-based sentence boundary detection (splitting on periods, exclamation marks, question marks, and double newlines), and (2) Embedding-based splitting that further divides sentences at word boundaries where consecutive word embeddings have low cosine similarity, indicating a semantic discontinuity. This dual approach captures both natural sentence boundaries and unnatural semantic shifts that injections often introduce.

Usage

Use this principle as the first step in the PromptLocate localization pipeline. The quality of segmentation directly affects the precision of injection boundary detection — finer segments enable more accurate localization.

Theoretical Basis

The segmentation combines syntactic and semantic signals:

Pseudo-code Logic:

# Stage 1: Sentence splitting
sentences = spacy_custom_segmenter(text)  # regex: r'(?:[.!?\n]{2,}|[.!?\n])(?:["\']?)'

# Stage 2: Embedding-based splitting
for each sentence:
    words = tokenize(sentence)
    embeddings = get_word_embeddings(words)
    for i in range(len(words)-1):
        sim = cosine_similarity(embeddings[i], embeddings[i+1])
        if sim < threshold:
            split_here()  # Low similarity = potential boundary

# Stage 3: Clean up
segments = merge_empty_segments(segments)

The cosine similarity threshold (default 0.0) determines how aggressively text is split at semantic boundaries.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment