Principle:Protectai Llm guard Malicious URL Detection

Knowledge Sources	Protectai_Llm_guard
Domains	Security, URL_Classification
Last Updated	2026-02-14 12:00 GMT

Overview

Classifying URLs extracted from text as benign or malicious using transformer-based URL classification.

Description

Large language models can generate text containing URLs that point to malicious destinations -- including phishing sites, malware distribution points, and defacement pages. This principle provides a security guardrail that extracts, analyzes, and classifies every URL found in the generated output.

The detection process begins with regex-based URL extraction to identify all URLs embedded in the output text. Each extracted URL is then individually fed into a CodeBERT-based classifier that has been fine-tuned on a dataset of labeled URLs spanning multiple threat categories:

Benign -- safe, legitimate URLs.
Phishing -- URLs designed to impersonate trusted sites and steal credentials.
Malware -- URLs that distribute malicious software.
Defacement -- URLs associated with website defacement attacks.

The classifier produces a probability score for each category. If the score for any malicious category exceeds the configured threshold, the URL is flagged as dangerous and the overall output is marked as unsafe.

Usage

Apply this principle when LLM outputs may contain URLs that users could visit:

Chatbots or assistants that provide web links in their responses.
Content generation systems where URLs are embedded in the output.
Code generation scenarios where generated code references external URLs.
Any application where clicking a malicious link could compromise user security.

Theoretical Basis

The malicious URL detection pipeline operates as follows:

1. Apply a regex pattern to extract all URLs from the output text.
2. For each extracted URL:
   a. Tokenize the URL string using the CodeBERT tokenizer, treating the URL
      as a sequence of subword tokens.
   b. Pass the tokenized URL through the fine-tuned classification model.
   c. Apply softmax to obtain category probabilities:
      P(benign), P(phishing), P(malware), P(defacement)
   d. Compute the maximum malicious score:
      max_threat = max(P(phishing), P(malware), P(defacement))
   e. If max_threat > threshold, flag the URL as malicious.
3. If any URL in the output is flagged as malicious, mark the entire output as unsafe.
4. Return the list of flagged URLs along with their threat categories and scores.

Related Pages

Implementation:Protectai_Llm_guard_Output_MaliciousURLs

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment