Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Protectai Llm guard Output MaliciousURLs

From Leeroopedia
Knowledge Sources
Domains Security, URL_Classification
Last Updated 2026-02-14 12:00 GMT

Overview

MaliciousURLs is an output scanner that detects and classifies URLs in LLM responses as benign or malicious using a CodeBERT-based model.

Description

The MaliciousURLs output scanner is not a thin wrapper; it has its own standalone implementation. It extracts URLs from the LLM output text and classifies each one using the DunnBC22/codebert-base-Malicious_URLs model by default. The model classifies URLs into categories including benign, defacement, phishing, and malware. If any URL in the output is classified as one of the malicious labels (defacement, phishing, or malware) with a confidence score above the threshold, the output is flagged as invalid. The scanner uses a text classification pipeline and processes all extracted URLs individually to provide fine-grained detection.

Usage

Use this scanner to prevent LLM outputs from containing malicious URLs. This is critical for chatbots and assistants that may generate or repeat URLs from training data or user-provided context. The scanner protects end users from phishing links, malware distribution URLs, and defaced websites that might appear in LLM responses.

Code Reference

Source Location

Signature

class MaliciousURLs(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.5,
        use_onnx: bool = False,
    ) -> None: ...

    def scan(self, prompt: str, output: str) -> tuple[str, bool, float]: ...

Import

from llm_guard.output_scanners import MaliciousURLs

I/O Contract

Inputs

Name Type Required Description
prompt str Yes The input prompt
output str Yes The LLM output to scan for malicious URLs

Constructor Parameters

Name Type Required Default Description
model None No None Custom URL classification model (defaults to DunnBC22/codebert-base-Malicious_URLs)
threshold float No 0.5 Minimum confidence score to classify a URL as malicious
use_onnx bool No False Whether to use ONNX runtime for inference

Outputs

Name Type Description
sanitized_output str The output (potentially modified)
is_valid bool Whether the output passed the scan (True if no malicious URLs found)
risk_score float Risk score (-1.0 to 1.0)

Usage Examples

Basic Usage

from llm_guard.output_scanners import MaliciousURLs

scanner = MaliciousURLs(threshold=0.5)

prompt = "Give me some useful resources"
output = "Check out https://example.com for documentation and https://legitimate-site.org for tutorials."

sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)

if not is_valid:
    print(f"Malicious URL detected (risk: {risk_score})")
else:
    print("All URLs appear safe")

Strict Threshold

from llm_guard.output_scanners import MaliciousURLs

# Use a lower threshold for stricter detection
scanner = MaliciousURLs(threshold=0.3)

prompt = "Where can I download the software?"
output = "You can download it from https://suspicious-download-site.xyz/free-software"

sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)
print(f"Safe: {is_valid}, Risk: {risk_score}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment