Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets MaliciousURL

From Leeroopedia
Revision as of 16:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Online_ml_River_Datasets_MaliciousURL.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Datasets, Binary_Classification, Security, Sparse_Data
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for binary classification with sparse features provided by the River library.

Description

Malicious URLs dataset. This dataset contains features about URLs that are classified as malicious or not. The dataset is exceptionally large-scale with over 2.3 million samples and more than 3.2 million sparse features, stored in LibSVM format across 150 daily files.

This dataset contains 2,396,130 samples with 3,231,961 sparse features for binary classification tasks.

Usage

This dataset is useful for:

  • Large-scale online learning with sparse features
  • URL security and malicious website detection
  • Evaluating algorithms on high-dimensional sparse data
  • Cybersecurity and web safety applications

Code Reference

Source Location

Signature

class MaliciousURL(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=2_396_130,
            n_features=3_231_961,
            task=base.BINARY_CLF,
            url="http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz",
            filename="url_svmlight",
            size=2_210_273_352,
            sparse=True,
        )

    def _iter(self):
        files = list(self.path.glob("Day*.svm"))
        files.sort(key=lambda x: int(os.path.basename(x).split(".")[0][3:]))

        def parse_libsvm_feature(f):
            k, v = f.split(":")
            return int(k), float(v)

        # There are 150 files with each one corresponding to a day
        for file in files:
            with open(file) as f:
                for line in f:
                    elements = line.rstrip().split(" ")
                    y = elements.pop(0) == "+1"
                    x = dict(parse_libsvm_feature(f) for f in elements)
                    yield x, y

Import

from river import datasets
dataset = datasets.MaliciousURL()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, bool) Yields (features_dict, target) pairs where features are sparse (integer keys)

Dataset Properties

Property Value
Number of samples 2,396,130
Number of features 3,231,961
Task Binary classification
Format LibSVM (150 daily files)
Size 2,210,273,352 bytes (~2.1 GB)
Sparse Yes

Features

  • Features are represented as sparse integer-indexed features extracted from URLs
  • The dataset spans 150 days of collected URL data
  • Each day's data is stored in a separate LibSVM file (Day1.svm through Day150.svm)
  • Features represent various URL characteristics and patterns

Usage Examples

from river import datasets

dataset = datasets.MaliciousURL()
for x, y in dataset:
    print(x, y)
    break

References

  • Detecting Malicious URLs
  • Ma, J., Saul, L.K., Savage, S. and Voelker, G.M., 2009. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In ICML. PDF

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment