Principle:Protectai Llm guard Refusal Detection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Quality_Assurance, Text_Classification |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
A binary text classification technique that detects when a Large Language Model refuses to answer a query, indicating the prompt may have triggered internal safety filters or policy restrictions.
Description
LLMs frequently refuse to answer certain queries by generating formulaic responses such as "I'm sorry, I cannot help with that" or "As an AI language model, I cannot...". Detecting these refusals is important for:
- Quality monitoring: Tracking how often the LLM refuses legitimate requests.
- Prompt refinement: Identifying prompts that need rephrasing to avoid triggering false-positive safety filters.
- Pipeline control: Flagging outputs that did not produce useful content.
The detection uses a fine-tuned classification model (default: distilroberta-base-rejection-v1) trained on examples of LLM refusal patterns. A lightweight alternative (NoRefusalLight) uses substring matching against known refusal phrases.
Usage
Use this principle in output scanning pipelines to detect when the LLM has refused to answer. This is typically used after the LLM call to determine if the response is useful or if the prompt needs adjustment.
Theoretical Basis
# Pseudocode for refusal detection
segments = match_type.get_inputs(output)
results = classifier(segments)
for result in results:
rejection_score = result["score"] if result["label"] == "REJECTION" else 1 - result["score"]
if rejection_score > threshold:
return REFUSAL_DETECTED
return NO_REFUSAL