Implementation:Online ml River Datasets WebTraffic

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Datasets, Multi_Output_Regression, Time_Series, Anomaly_Detection
Last Updated	2026-02-08 16:00 GMT

Overview

Concrete dataset for multi-output regression provided by the River library.

Description

Web sessions information from an events company based in South Africa. The goal is to predict the number of web sessions in 4 different regions in South Africa.

The data consists of 15 minute interval traffic values between '2023-06-16 00:00:00' and '2023-09-15 23:45:00' for each region. Two types of sessions are captured: sessionsA and sessionsB. The isMissing flag is equal to 1 if any of the servers failed to capture sessions, otherwise if all servers functioned properly this flag is equal to 0.

This dataset contains 44,160 samples with 3 features and 2 output targets for multi-output regression tasks.

Usage

This dataset is useful for:

Multi-output time series forecasting
Anomaly detection in time series data
Handling missing values in streaming data
Multi-region prediction problems
Building robust models that handle anomalous events

Code Reference

Source Location

Repository: Online_ml_River
File: river/datasets/web_traffic.py

Signature

class WebTraffic(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            url="https://maxhalford.github.io/files/datasets/web-traffic.csv.zip",
            filename="web-traffic.csv",
            task=base.MO_REG,
            n_features=3,
            n_outputs=2,
            n_samples=44_160,
            size=2_769_905,
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            dialect=PipeCSVDialect,
            target=["sessionsA", "sessionsB"],
            converters={
                "region": str,
                "isMissing": lambda x: x == "1.0",
                "sessionsA": lambda x: float(x) if x and x != "0.0" else None,
                "sessionsB": lambda x: float(x) if x and x != "0.0" else None,
            },
            parse_dates={"dateTime": "%Y-%m-%d %H:%M:%S"},
        )

Import

from river import datasets
dataset = datasets.WebTraffic()

I/O Contract

Inputs

Name	Type	Required	Description
(none)	—	—	No parameters needed

Outputs

Name	Type	Description
iter()	tuple(dict, dict)	Yields (features_dict, targets_dict) where targets contain sessionsA and sessionsB

Dataset Properties

Property	Value
Number of samples	44,160
Number of features	3
Number of outputs	2
Task	Multi-output regression
Format	CSV (pipe-delimited, compressed)
Size	2,769,905 bytes
Time period	June 16 - Sept 15, 2023 (15-min intervals)
Regions	4 (including backup region R5)

Features

The dataset includes the following features:

dateTime: Timestamp of the observation (datetime, 15-minute intervals)
region: Region identifier (string)
isMissing: Flag indicating if servers failed to capture sessions (boolean)

Target Outputs

Two simultaneous regression targets:

sessionsA: Number of type A sessions (float or None if missing)
sessionsB: Number of type B sessions (float or None if missing)

Key Considerations

Region R5 captures sessions in backup mode and may not be necessary to predict
Missing values are explicitly marked with the isMissing flag
The dataset includes anomalous events where servers failed
Can be used for both forecasting and anomaly detection

Usage Examples

from river import datasets

dataset = datasets.WebTraffic()
for x, y in dataset:
    print(f"Features: {x}")
    print(f"Targets: {y}")
    break

Example with Multi-Output Regression

from river import datasets, linear_model, preprocessing

dataset = datasets.WebTraffic()

model = (
    preprocessing.StandardScaler() |
    linear_model.LinearRegression()
)

for x, y in dataset:
    if y['sessionsA'] is not None and y['sessionsB'] is not None:
        model.predict_one(x)
        model.learn_one(x, y)

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment