Implementation:Online ml River Datasets MaliciousURL
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Binary_Classification, Security, Sparse_Data |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for binary classification with sparse features provided by the River library.
Description
Malicious URLs dataset. This dataset contains features about URLs that are classified as malicious or not. The dataset is exceptionally large-scale with over 2.3 million samples and more than 3.2 million sparse features, stored in LibSVM format across 150 daily files.
This dataset contains 2,396,130 samples with 3,231,961 sparse features for binary classification tasks.
Usage
This dataset is useful for:
- Large-scale online learning with sparse features
- URL security and malicious website detection
- Evaluating algorithms on high-dimensional sparse data
- Cybersecurity and web safety applications
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/malicious_url.py
Signature
class MaliciousURL(base.RemoteDataset):
def __init__(self):
super().__init__(
n_samples=2_396_130,
n_features=3_231_961,
task=base.BINARY_CLF,
url="http://www.sysnet.ucsd.edu/projects/url/url_svmlight.tar.gz",
filename="url_svmlight",
size=2_210_273_352,
sparse=True,
)
def _iter(self):
files = list(self.path.glob("Day*.svm"))
files.sort(key=lambda x: int(os.path.basename(x).split(".")[0][3:]))
def parse_libsvm_feature(f):
k, v = f.split(":")
return int(k), float(v)
# There are 150 files with each one corresponding to a day
for file in files:
with open(file) as f:
for line in f:
elements = line.rstrip().split(" ")
y = elements.pop(0) == "+1"
x = dict(parse_libsvm_feature(f) for f in elements)
yield x, y
Import
from river import datasets
dataset = datasets.MaliciousURL()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, bool) | Yields (features_dict, target) pairs where features are sparse (integer keys) |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 2,396,130 |
| Number of features | 3,231,961 |
| Task | Binary classification |
| Format | LibSVM (150 daily files) |
| Size | 2,210,273,352 bytes (~2.1 GB) |
| Sparse | Yes |
Features
- Features are represented as sparse integer-indexed features extracted from URLs
- The dataset spans 150 days of collected URL data
- Each day's data is stored in a separate LibSVM file (Day1.svm through Day150.svm)
- Features represent various URL characteristics and patterns
Usage Examples
from river import datasets
dataset = datasets.MaliciousURL()
for x, y in dataset:
print(x, y)
break
References
- Detecting Malicious URLs
- Ma, J., Saul, L.K., Savage, S. and Voelker, G.M., 2009. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In ICML. PDF
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment