Implementation:Protectai Llm guard NoRefusal

Knowledge Sources	LLM Guard LLM Guard Documentation
Domains	NLP, Quality_Assurance, Text_Classification
Last Updated	2026-02-14 12:00 GMT

Overview

Concrete tool for detecting LLM refusal patterns in output text using a fine-tuned DistilRoBERTa classification model, provided by the LLM Guard library.

Description

The NoRefusal class is an output scanner that detects when an LLM has refused to answer a query. It uses the ProtectAI/distilroberta-base-rejection-v1 model for binary classification (REJECTION vs non-rejection). Supports full-text and sentence-level matching, with ONNX runtime support. A lightweight alternative NoRefusalLight uses substring matching against 27 known refusal phrases.

Usage

Import this scanner to detect refusal patterns in LLM outputs. Place it in the output scanner pipeline to flag non-useful responses.

Code Reference

Source Location

Repository: llm-guard
File: llm_guard/output_scanners/no_refusal.py
Lines: L39-107

Signature

class NoRefusal(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.75,
        match_type: MatchType | str = MatchType.FULL,
        use_onnx: bool = False,
    ) -> None:
        """
        Args:
            model: HuggingFace model for classification. Default: distilroberta-base-rejection-v1.
            threshold: Rejection score threshold. Default: 0.75.
            match_type: FULL or SENTENCE level matching. Default: FULL.
            use_onnx: Use ONNX runtime. Default: False.
        """

    def scan(self, prompt: str, output: str) -> tuple[str, bool, float]:
        """
        Detect refusal patterns in output.

        Returns:
            - Original output (unmodified)
            - False if refusal detected, True otherwise
            - Risk score normalized against threshold
        """

Import

from llm_guard.output_scanners import NoRefusal

I/O Contract

Inputs

Name	Type	Required	Description
model	Model	No	HuggingFace model (default: distilroberta-base-rejection-v1)
threshold	float	No	Rejection score threshold (default: 0.75)
match_type	MatchType or str	No	FULL or SENTENCE (default: FULL)
use_onnx	bool	No	Use ONNX runtime (default: False)
prompt	str	Yes (scan)	Original prompt
output	str	Yes (scan)	LLM output to check

Outputs

Name	Type	Description
output	str	Original output (unmodified)
is_valid	bool	False if refusal detected above threshold
risk_score	float	Normalized rejection confidence score

Usage Examples

Basic Refusal Detection

from llm_guard.output_scanners import NoRefusal

scanner = NoRefusal(threshold=0.75)

prompt = "How do I bake a cake?"
output = "Here is a simple recipe for chocolate cake..."
_, is_valid, _ = scanner.scan(prompt, output)
# is_valid: True

refusal_output = "I'm sorry, but I cannot help with that request."
_, is_valid, score = scanner.scan(prompt, refusal_output)
# is_valid: False

Related Pages

Implements Principle

Principle:Protectai_Llm_guard_Refusal_Detection

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment