Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Protectai Llm guard NoRefusal

From Leeroopedia
Knowledge Sources
Domains NLP, Quality_Assurance, Text_Classification
Last Updated 2026-02-14 12:00 GMT

Overview

Concrete tool for detecting LLM refusal patterns in output text using a fine-tuned DistilRoBERTa classification model, provided by the LLM Guard library.

Description

The NoRefusal class is an output scanner that detects when an LLM has refused to answer a query. It uses the ProtectAI/distilroberta-base-rejection-v1 model for binary classification (REJECTION vs non-rejection). Supports full-text and sentence-level matching, with ONNX runtime support. A lightweight alternative NoRefusalLight uses substring matching against 27 known refusal phrases.

Usage

Import this scanner to detect refusal patterns in LLM outputs. Place it in the output scanner pipeline to flag non-useful responses.

Code Reference

Source Location

  • Repository: llm-guard
  • File: llm_guard/output_scanners/no_refusal.py
  • Lines: L39-107

Signature

class NoRefusal(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.75,
        match_type: MatchType | str = MatchType.FULL,
        use_onnx: bool = False,
    ) -> None:
        """
        Args:
            model: HuggingFace model for classification. Default: distilroberta-base-rejection-v1.
            threshold: Rejection score threshold. Default: 0.75.
            match_type: FULL or SENTENCE level matching. Default: FULL.
            use_onnx: Use ONNX runtime. Default: False.
        """

    def scan(self, prompt: str, output: str) -> tuple[str, bool, float]:
        """
        Detect refusal patterns in output.

        Returns:
            - Original output (unmodified)
            - False if refusal detected, True otherwise
            - Risk score normalized against threshold
        """

Import

from llm_guard.output_scanners import NoRefusal

I/O Contract

Inputs

Name Type Required Description
model Model No HuggingFace model (default: distilroberta-base-rejection-v1)
threshold float No Rejection score threshold (default: 0.75)
match_type MatchType or str No FULL or SENTENCE (default: FULL)
use_onnx bool No Use ONNX runtime (default: False)
prompt str Yes (scan) Original prompt
output str Yes (scan) LLM output to check

Outputs

Name Type Description
output str Original output (unmodified)
is_valid bool False if refusal detected above threshold
risk_score float Normalized rejection confidence score

Usage Examples

Basic Refusal Detection

from llm_guard.output_scanners import NoRefusal

scanner = NoRefusal(threshold=0.75)

prompt = "How do I bake a cake?"
output = "Here is a simple recipe for chocolate cake..."
_, is_valid, _ = scanner.scan(prompt, output)
# is_valid: True

refusal_output = "I'm sorry, but I cannot help with that request."
_, is_valid, score = scanner.scan(prompt, refusal_output)
# is_valid: False

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment