Heuristic:Protectai Llm guard Token Limit Early Guard
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Security |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Pipeline ordering heuristic: place the TokenLimit scanner first in the pipeline to reject oversized prompts before wasting compute on expensive ML-based scanners.
Description
The TokenLimit scanner uses tiktoken (a fast BPE tokenizer) to count tokens and truncate prompts that exceed a configured limit. It is computationally cheap compared to ML-based scanners like PromptInjection or Toxicity, which require full transformer inference. Placing TokenLimit early in the scanner pipeline acts as a gatekeeper that prevents oversized inputs from consuming expensive compute resources downstream.
Usage
Use this heuristic when designing scanner pipelines that include both lightweight scanners (TokenLimit, BanSubstrings, Regex, InvisibleText) and heavyweight ML-based scanners (PromptInjection, Toxicity, BanTopics, Anonymize). Order cheap scanners first, especially when fail_fast is enabled.
The Insight (Rule of Thumb)
- Action: Place TokenLimit, BanSubstrings, Regex, InvisibleText, and Secrets scanners before PromptInjection, Toxicity, BanTopics, Anonymize, and other ML-based scanners in the pipeline configuration.
- Value: TokenLimit default is 4096 tokens using
cl100k_baseencoding. - Trade-off: None. This is a pure optimization with no accuracy impact, since scanner order does not affect individual scanner results.
- Combination: Pairs well with
fail_fast=Truefor maximum latency reduction on invalid inputs.
Reasoning
ML-based scanners run transformer models with O(n^2) attention complexity relative to input length. Processing a 100K-token prompt through PromptInjection (DeBERTa, max_length=512) is wasteful if the prompt will be rejected anyway for exceeding the token limit. The TokenLimit scanner uses tiktoken's O(n) BPE encoding which is orders of magnitude faster than a transformer forward pass.
# From llm_guard/input_scanners/token_limit.py:61-80
def scan(self, prompt: str) -> tuple[str, bool, float]:
if prompt.strip() == "":
return prompt, True, -1.0
chunks, num_tokens = self._split_text_on_tokens(text=prompt)
if num_tokens < self._limit:
LOGGER.debug("Prompt fits the maximum tokens", ...)
return prompt, True, -1.0
LOGGER.warning("Prompt is too big. Splitting into chunks", ...)
return chunks[0], False, 1.0