Heuristic:Liu00222 Open Prompt Injection Cosine Similarity Segmentation Threshold
| Knowledge Sources | |
|---|---|
| Domains | NLP, Security, Optimization |
| Last Updated | 2026-02-14 15:30 GMT |
Overview
Text segmentation for injection localization uses a cosine similarity threshold of 0.0 between consecutive word embeddings to determine segment boundaries.
Description
The PromptLocate defense splits input text into segments before performing binary search injection localization. The segmentation uses a two-stage approach: first, spaCy sentence segmentation with a custom regex-based sentence boundary detector; second, within each sentence, consecutive word embeddings are compared using cosine similarity, and segment boundaries are placed where similarity drops below a threshold. The default threshold of 0.0 means that segments are only split when consecutive word embeddings point in opposite directions (negative cosine similarity), preserving most sentence-level coherence.
Usage
Use this heuristic when configuring or debugging PromptLocate segmentation. The threshold is set via the `sep_thres` parameter in the `PromptLocate` constructor (default 0.0). Increasing the threshold creates more, smaller segments (finer granularity but more binary search queries). Decreasing it below 0.0 creates fewer, larger segments (coarser granularity but fewer queries).
The Insight (Rule of Thumb)
- Action: Set `sep_thres=0.0` (default) in the `PromptLocate` constructor for balanced segmentation.
- Value: Threshold of 0.0 splits only where consecutive embeddings have negative cosine similarity. This is the default used in the paper.
- Trade-off: Higher threshold = more segments = more precise localization but more detector queries (higher latency). Lower threshold = fewer segments = coarser localization but fewer queries.
- Embeddings source: Uses the base model's input embedding layer (`base_model.get_input_embeddings()`), not a separate embedding model.
- Empty segment handling: After splitting, empty segments are merged with their preceding segment via `merge_empty_segments()`.
Reasoning
From `apps/PromptLocate.py:246-250` (initialization with default threshold):
class PromptLocate:
def __init__(self, model_config, helper_model_name="gpt2", sep_thres=0.0):
self.bd = DataSentinelDetector(model_config)
self.bd.model.tokenizer.pad_token = self.bd.model.tokenizer.eos_token
self.embedding_layer = self.bd.model.base_model.get_input_embeddings()
self.sep_thres = sep_thres
From `apps/PromptLocate.py:188-197` (cosine similarity computation and splitting):
inputs = tokenizer(words, return_tensors="pt", padding=True, truncation=True).to("cuda")
input_embeds = embedding_layer(inputs["input_ids"]) # [batch_size, seq_len, hidden_dim]
embeddings = input_embeds.mean(dim=1).detach().cpu().numpy()
cos_sim = np.array([
np.dot(embeddings[i], embeddings[i + 1]) /
(np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1]))
for i in range(len(embeddings) - 1)
])
split_positions = np.where(cos_sim < thres)[0]
The threshold acts as a sensitivity dial: at 0.0, only semantically opposed consecutive words trigger a split, keeping segments relatively large and the binary search efficient.