Implementation:Protectai Llm guard NoRefusal
| Knowledge Sources | |
|---|---|
| Domains | NLP, Quality_Assurance, Text_Classification |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Concrete tool for detecting LLM refusal patterns in output text using a fine-tuned DistilRoBERTa classification model, provided by the LLM Guard library.
Description
The NoRefusal class is an output scanner that detects when an LLM has refused to answer a query. It uses the ProtectAI/distilroberta-base-rejection-v1 model for binary classification (REJECTION vs non-rejection). Supports full-text and sentence-level matching, with ONNX runtime support. A lightweight alternative NoRefusalLight uses substring matching against 27 known refusal phrases.
Usage
Import this scanner to detect refusal patterns in LLM outputs. Place it in the output scanner pipeline to flag non-useful responses.
Code Reference
Source Location
- Repository: llm-guard
- File: llm_guard/output_scanners/no_refusal.py
- Lines: L39-107
Signature
class NoRefusal(Scanner):
def __init__(
self,
*,
model: Model | None = None,
threshold: float = 0.75,
match_type: MatchType | str = MatchType.FULL,
use_onnx: bool = False,
) -> None:
"""
Args:
model: HuggingFace model for classification. Default: distilroberta-base-rejection-v1.
threshold: Rejection score threshold. Default: 0.75.
match_type: FULL or SENTENCE level matching. Default: FULL.
use_onnx: Use ONNX runtime. Default: False.
"""
def scan(self, prompt: str, output: str) -> tuple[str, bool, float]:
"""
Detect refusal patterns in output.
Returns:
- Original output (unmodified)
- False if refusal detected, True otherwise
- Risk score normalized against threshold
"""
Import
from llm_guard.output_scanners import NoRefusal
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | Model | No | HuggingFace model (default: distilroberta-base-rejection-v1) |
| threshold | float | No | Rejection score threshold (default: 0.75) |
| match_type | MatchType or str | No | FULL or SENTENCE (default: FULL) |
| use_onnx | bool | No | Use ONNX runtime (default: False) |
| prompt | str | Yes (scan) | Original prompt |
| output | str | Yes (scan) | LLM output to check |
Outputs
| Name | Type | Description |
|---|---|---|
| output | str | Original output (unmodified) |
| is_valid | bool | False if refusal detected above threshold |
| risk_score | float | Normalized rejection confidence score |
Usage Examples
Basic Refusal Detection
from llm_guard.output_scanners import NoRefusal
scanner = NoRefusal(threshold=0.75)
prompt = "How do I bake a cake?"
output = "Here is a simple recipe for chocolate cake..."
_, is_valid, _ = scanner.scan(prompt, output)
# is_valid: True
refusal_output = "I'm sorry, but I cannot help with that request."
_, is_valid, score = scanner.scan(prompt, refusal_output)
# is_valid: False