Overview
MaliciousURLs is an output scanner that detects and classifies URLs in LLM responses as benign or malicious using a CodeBERT-based model.
Description
The MaliciousURLs output scanner is not a thin wrapper; it has its own standalone implementation. It extracts URLs from the LLM output text and classifies each one using the DunnBC22/codebert-base-Malicious_URLs model by default. The model classifies URLs into categories including benign, defacement, phishing, and malware. If any URL in the output is classified as one of the malicious labels (defacement, phishing, or malware) with a confidence score above the threshold, the output is flagged as invalid. The scanner uses a text classification pipeline and processes all extracted URLs individually to provide fine-grained detection.
Usage
Use this scanner to prevent LLM outputs from containing malicious URLs. This is critical for chatbots and assistants that may generate or repeat URLs from training data or user-provided context. The scanner protects end users from phishing links, malware distribution URLs, and defaced websites that might appear in LLM responses.
Code Reference
Source Location
Signature
class MaliciousURLs(Scanner):
def __init__(
self,
*,
model: Model | None = None,
threshold: float = 0.5,
use_onnx: bool = False,
) -> None: ...
def scan(self, prompt: str, output: str) -> tuple[str, bool, float]: ...
Import
from llm_guard.output_scanners import MaliciousURLs
I/O Contract
Inputs
| Name |
Type |
Required |
Description
|
| prompt |
str |
Yes |
The input prompt
|
| output |
str |
Yes |
The LLM output to scan for malicious URLs
|
Constructor Parameters
| Name |
Type |
Required |
Default |
Description
|
| model |
None |
No |
None |
Custom URL classification model (defaults to DunnBC22/codebert-base-Malicious_URLs)
|
| threshold |
float |
No |
0.5 |
Minimum confidence score to classify a URL as malicious
|
| use_onnx |
bool |
No |
False |
Whether to use ONNX runtime for inference
|
Outputs
| Name |
Type |
Description
|
| sanitized_output |
str |
The output (potentially modified)
|
| is_valid |
bool |
Whether the output passed the scan (True if no malicious URLs found)
|
| risk_score |
float |
Risk score (-1.0 to 1.0)
|
Usage Examples
Basic Usage
from llm_guard.output_scanners import MaliciousURLs
scanner = MaliciousURLs(threshold=0.5)
prompt = "Give me some useful resources"
output = "Check out https://example.com for documentation and https://legitimate-site.org for tutorials."
sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)
if not is_valid:
print(f"Malicious URL detected (risk: {risk_score})")
else:
print("All URLs appear safe")
Strict Threshold
from llm_guard.output_scanners import MaliciousURLs
# Use a lower threshold for stricter detection
scanner = MaliciousURLs(threshold=0.3)
prompt = "Where can I download the software?"
output = "You can download it from https://suspicious-download-site.xyz/free-software"
sanitized_output, is_valid, risk_score = scanner.scan(prompt, output)
print(f"Safe: {is_valid}, Risk: {risk_score}")
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.