Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets WebTraffic

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Datasets, Multi_Output_Regression, Time_Series, Anomaly_Detection
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for multi-output regression provided by the River library.

Description

Web sessions information from an events company based in South Africa. The goal is to predict the number of web sessions in 4 different regions in South Africa.

The data consists of 15 minute interval traffic values between '2023-06-16 00:00:00' and '2023-09-15 23:45:00' for each region. Two types of sessions are captured: sessionsA and sessionsB. The isMissing flag is equal to 1 if any of the servers failed to capture sessions, otherwise if all servers functioned properly this flag is equal to 0.

This dataset contains 44,160 samples with 3 features and 2 output targets for multi-output regression tasks.

Usage

This dataset is useful for:

  • Multi-output time series forecasting
  • Anomaly detection in time series data
  • Handling missing values in streaming data
  • Multi-region prediction problems
  • Building robust models that handle anomalous events

Code Reference

Source Location

Signature

class WebTraffic(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            url="https://maxhalford.github.io/files/datasets/web-traffic.csv.zip",
            filename="web-traffic.csv",
            task=base.MO_REG,
            n_features=3,
            n_outputs=2,
            n_samples=44_160,
            size=2_769_905,
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            dialect=PipeCSVDialect,
            target=["sessionsA", "sessionsB"],
            converters={
                "region": str,
                "isMissing": lambda x: x == "1.0",
                "sessionsA": lambda x: float(x) if x and x != "0.0" else None,
                "sessionsB": lambda x: float(x) if x and x != "0.0" else None,
            },
            parse_dates={"dateTime": "%Y-%m-%d %H:%M:%S"},
        )

Import

from river import datasets
dataset = datasets.WebTraffic()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, dict) Yields (features_dict, targets_dict) where targets contain sessionsA and sessionsB

Dataset Properties

Property Value
Number of samples 44,160
Number of features 3
Number of outputs 2
Task Multi-output regression
Format CSV (pipe-delimited, compressed)
Size 2,769,905 bytes
Time period June 16 - Sept 15, 2023 (15-min intervals)
Regions 4 (including backup region R5)

Features

The dataset includes the following features:

  • dateTime: Timestamp of the observation (datetime, 15-minute intervals)
  • region: Region identifier (string)
  • isMissing: Flag indicating if servers failed to capture sessions (boolean)

Target Outputs

Two simultaneous regression targets:

  • sessionsA: Number of type A sessions (float or None if missing)
  • sessionsB: Number of type B sessions (float or None if missing)

Key Considerations

  • Region R5 captures sessions in backup mode and may not be necessary to predict
  • Missing values are explicitly marked with the isMissing flag
  • The dataset includes anomalous events where servers failed
  • Can be used for both forecasting and anomaly detection

Usage Examples

from river import datasets

dataset = datasets.WebTraffic()
for x, y in dataset:
    print(f"Features: {x}")
    print(f"Targets: {y}")
    break

Example with Multi-Output Regression

from river import datasets, linear_model, preprocessing

dataset = datasets.WebTraffic()

model = (
    preprocessing.StandardScaler() |
    linear_model.LinearRegression()
)

for x, y in dataset:
    if y['sessionsA'] is not None and y['sessionsB'] is not None:
        model.predict_one(x)
        model.learn_one(x, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment