Implementation:Online ml River Datasets WebTraffic
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Datasets, Multi_Output_Regression, Time_Series, Anomaly_Detection |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Concrete dataset for multi-output regression provided by the River library.
Description
Web sessions information from an events company based in South Africa. The goal is to predict the number of web sessions in 4 different regions in South Africa.
The data consists of 15 minute interval traffic values between '2023-06-16 00:00:00' and '2023-09-15 23:45:00' for each region. Two types of sessions are captured: sessionsA and sessionsB. The isMissing flag is equal to 1 if any of the servers failed to capture sessions, otherwise if all servers functioned properly this flag is equal to 0.
This dataset contains 44,160 samples with 3 features and 2 output targets for multi-output regression tasks.
Usage
This dataset is useful for:
- Multi-output time series forecasting
- Anomaly detection in time series data
- Handling missing values in streaming data
- Multi-region prediction problems
- Building robust models that handle anomalous events
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/datasets/web_traffic.py
Signature
class WebTraffic(base.RemoteDataset):
def __init__(self):
super().__init__(
url="https://maxhalford.github.io/files/datasets/web-traffic.csv.zip",
filename="web-traffic.csv",
task=base.MO_REG,
n_features=3,
n_outputs=2,
n_samples=44_160,
size=2_769_905,
)
def _iter(self):
return stream.iter_csv(
self.path,
dialect=PipeCSVDialect,
target=["sessionsA", "sessionsB"],
converters={
"region": str,
"isMissing": lambda x: x == "1.0",
"sessionsA": lambda x: float(x) if x and x != "0.0" else None,
"sessionsB": lambda x: float(x) if x and x != "0.0" else None,
},
parse_dates={"dateTime": "%Y-%m-%d %H:%M:%S"},
)
Import
from river import datasets
dataset = datasets.WebTraffic()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (none) | — | — | No parameters needed |
Outputs
| Name | Type | Description |
|---|---|---|
| iter() | tuple(dict, dict) | Yields (features_dict, targets_dict) where targets contain sessionsA and sessionsB |
Dataset Properties
| Property | Value |
|---|---|
| Number of samples | 44,160 |
| Number of features | 3 |
| Number of outputs | 2 |
| Task | Multi-output regression |
| Format | CSV (pipe-delimited, compressed) |
| Size | 2,769,905 bytes |
| Time period | June 16 - Sept 15, 2023 (15-min intervals) |
| Regions | 4 (including backup region R5) |
Features
The dataset includes the following features:
- dateTime: Timestamp of the observation (datetime, 15-minute intervals)
- region: Region identifier (string)
- isMissing: Flag indicating if servers failed to capture sessions (boolean)
Target Outputs
Two simultaneous regression targets:
- sessionsA: Number of type A sessions (float or None if missing)
- sessionsB: Number of type B sessions (float or None if missing)
Key Considerations
- Region R5 captures sessions in backup mode and may not be necessary to predict
- Missing values are explicitly marked with the isMissing flag
- The dataset includes anomalous events where servers failed
- Can be used for both forecasting and anomaly detection
Usage Examples
from river import datasets
dataset = datasets.WebTraffic()
for x, y in dataset:
print(f"Features: {x}")
print(f"Targets: {y}")
break
Example with Multi-Output Regression
from river import datasets, linear_model, preprocessing
dataset = datasets.WebTraffic()
model = (
preprocessing.StandardScaler() |
linear_model.LinearRegression()
)
for x, y in dataset:
if y['sessionsA'] is not None and y['sessionsB'] is not None:
model.predict_one(x)
model.learn_one(x, y)