Implementation:Protectai Llm guard Output MaliciousURLs

Knowledge Sources	Protectai_Llm_guard
Domains	Security, URL_Classification
Last Updated	2026-02-14 12:00 GMT

Overview

MaliciousURLs is an output scanner that detects and classifies URLs in LLM responses as benign or malicious using a CodeBERT-based model.

Description

The MaliciousURLs output scanner is not a thin wrapper; it has its own standalone implementation. It extracts URLs from the LLM output text and classifies each one using the DunnBC22/codebert-base-Malicious_URLs model by default. The model classifies URLs into categories including benign, defacement, phishing, and malware. If any URL in the output is classified as one of the malicious labels (defacement, phishing, or malware) with a confidence score above the threshold, the output is flagged as invalid. The scanner uses a text classification pipeline and processes all extracted URLs individually to provide fine-grained detection.

Usage

Use this scanner to prevent LLM outputs from containing malicious URLs. This is critical for chatbots and assistants that may generate or repeat URLs from training data or user-provided context. The scanner protects end users from phishing links, malware distribution URLs, and defaced websites that might appear in LLM responses.

Code Reference

Source Location

Repository: Protectai_Llm_guard
File: llm_guard/output_scanners/malicious_urls.py
Lines: 1-112

Signature

class MaliciousURLs(Scanner):
    def __init__(
        self,
        *,
        model: Model | None = None,
        threshold: float = 0.5,
        use_onnx: bool = False,
    ) -> None: ...

    def scan(self, prompt: str, output: str) -> tuple[str, bool, float]: ...

Import

from llm_guard.output_scanners import MaliciousURLs

I/O Contract

Inputs

Name	Type	Required	Description
prompt	str	Yes	The input prompt
output	str	Yes	The LLM output to scan for malicious URLs

Constructor Parameters

Name	Type	Required	Default	Description
model	None	No	None	Custom URL classification model (defaults to DunnBC22/codebert-base-Malicious_URLs)
threshold	float	No	0.5	Minimum confidence score to classify a URL as malicious
use_onnx	bool	No	False	Whether to use ONNX runtime for inference

Outputs

Name	Type	Description
sanitized_output	str	The output (potentially modified)
is_valid	bool	Whether the output passed the scan (True if no malicious URLs found)
risk_score	float	Risk score (-1.0 to 1.0)

Usage Examples

Basic Usage

from llm_guard.output_scanners import MaliciousURLs

scanner = MaliciousURLs(threshold=0.5)

prompt = "Give me some useful resources"
output = "Check out https://example.com for documentation and https://legitimate-site.org for tutorials."

sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)

if not is_valid:
    print(f"Malicious URL detected (risk: {risk_score})")
else:
    print("All URLs appear safe")

Strict Threshold

from llm_guard.output_scanners import MaliciousURLs

# Use a lower threshold for stricter detection
scanner = MaliciousURLs(threshold=0.3)

prompt = "Where can I download the software?"
output = "You can download it from https://suspicious-download-site.xyz/free-software"

sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)
print(f"Safe: {is_valid}, Risk: {risk_score}")

Related Pages

Principle:Protectai_Llm_guard_Malicious_URL_Detection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment