Heuristic:Lakeraai Pint benchmark Chunking Stride 25 Percent Overlap
| Knowledge Sources | |
|---|---|
| Domains | NLP, Optimization, Prompt_Injection |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
Tokenization overlap strategy using a 25% stride when chunking long inputs to prevent missed prompt injection detections at chunk boundaries.
Description
When evaluating prompts that exceed a model's maximum token length, the input must be split into multiple chunks. Naive chunking (no overlap) risks splitting a prompt injection payload across two chunks, causing both chunks to appear benign individually. The PINT Benchmark addresses this by using a sliding window with a 25% overlap stride: each successive chunk starts 75% of max_length tokens ahead of the previous one, so 25% of each chunk's content overlaps with the next.
This strategy is combined with an any-positive aggregation: if any chunk is classified as injection, the entire input is flagged as positive. Together, the overlap and aggregation ensure that injections embedded deep within long documents or positioned near chunk boundaries are reliably detected.
Usage
Apply this heuristic when evaluating prompt injection detection models on inputs that may exceed the model's context window (typically 512 tokens). It is built into the HuggingFaceModelEvaluation._chunk_input() method and applies automatically to all models evaluated through the PINT Benchmark's HuggingFace evaluation utility.
The Insight (Rule of Thumb)
- Action: Set the
strideparameter to 25% ofmax_lengthwhen calling the tokenizer withreturn_overflowing_tokens=True. - Value:
stride = math.floor(max_length / 4). For a typicalmax_length=512, the stride is 128 tokens. - Trade-off: Increases the number of chunks (and therefore inference calls) compared to non-overlapping chunking. For a 2048-token input with max_length=512 and stride=128: non-overlapping produces 4 chunks, while 25% overlap produces approximately 5 chunks (~25% more inference).
- Aggregation: Use
any()over chunk predictions — if any chunk is flagged as injection, the entire input is positive.
Reasoning
Prompt injection payloads embedded in long documents (e.g., a system prompt followed by thousands of tokens of context, with an injection near the end) may land exactly at the boundary between two chunks. Without overlap, the payload could be split in half across chunks, with neither half containing enough signal for the classifier to detect it.
The 25% overlap value is a practical balance:
- Too little overlap (e.g., 5-10%): Risk of still missing boundary injections, especially for payloads spanning many tokens.
- Too much overlap (e.g., 50%): Doubles the number of chunks, significantly increasing inference time without proportional detection improvement.
- 25% overlap: Provides a safety margin of 128 tokens (at max_length=512) while keeping the additional compute cost moderate.
The comment in the source code explicitly states the rationale: "add overlap of 25% of the max length to make sure we aren't missing prompt injections at the edge of the model's context length when chunking."
Code Evidence
Chunking stride logic from benchmark/utils/evaluate_hugging_face_model.py:77-88:
tokenized = self.tokenizer(
prompt,
max_length=self.max_length,
truncation=True,
return_overflowing_tokens=True,
padding=True,
return_tensors="pt",
add_special_tokens=False,
# add overlap of 25% of the max length to make sure we aren't
# missing prompt injections at the edge of the model's context
# length when chunking
stride=math.floor(self.max_length / 4),
)
Any-positive aggregation from benchmark/utils/evaluate_hugging_face_model.py:135-148:
def evaluate(self, prompt: str) -> bool:
tokenized_chunks = self._chunk_input(prompt=prompt)
evaluations = self._evaluate_chunks(tokenized_chunks=tokenized_chunks)
if self.is_setfit:
return any(evaluation == 1 for evaluation in evaluations)
else:
return any(
[
evaluation["label"] == self.injection_label
for evaluation in evaluations
]
)