Implementation:Online ml River Datasets Taxis

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Datasets, Regression
Last Updated	2026-02-08 16:00 GMT

Overview

Concrete dataset for regression provided by the River library.

Description

Taxi ride durations in New York City. The goal is to predict the duration of taxi rides in New York City based on pickup/dropoff locations, timestamps, and other features.

This dataset contains 1,458,644 samples with 8 features for regression tasks.

Usage

This dataset is useful for:

Time series regression and duration prediction
Geospatial feature engineering
Transportation and urban mobility analysis
Large-scale regression problems

Code Reference

Source Location

Repository: Online_ml_River
File: river/datasets/taxis.py

Signature

class Taxis(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=1_458_644,
            n_features=8,
            task=base.REG,
            url="https://maxhalford.github.io/files/datasets/nyc_taxis.zip",
            size=195_271_696,
            filename="train.csv",
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            target="trip_duration",
            converters={
                "passenger_count": int,
                "pickup_longitude": float,
                "pickup_latitude": float,
                "dropoff_longitude": float,
                "dropoff_latitude": float,
                "trip_duration": int,
            },
            parse_dates={"pickup_datetime": "%Y-%m-%d %H:%M:%S"},
            drop=["dropoff_datetime", "id"],
        )

Import

from river import datasets
dataset = datasets.Taxis()

I/O Contract

Inputs

Name	Type	Required	Description
(none)	—	—	No parameters needed

Outputs

Name	Type	Description
iter()	tuple(dict, int)	Yields (features_dict, target) pairs where target is trip duration in seconds

Dataset Properties

Property	Value
Number of samples	1,458,644
Number of features	8
Task	Regression
Format	CSV (compressed)
Size	195,271,696 bytes (~186 MB)

Features

The dataset includes the following features:

passenger_count: Number of passengers (integer)
pickup_longitude: Longitude of pickup location (float)
pickup_latitude: Latitude of pickup location (float)
dropoff_longitude: Longitude of dropoff location (float)
dropoff_latitude: Latitude of dropoff location (float)
pickup_datetime: Timestamp of pickup (datetime)
Additional metadata about the trip
trip_duration: Duration of the trip in seconds (target variable, integer)

Usage Examples

from river import datasets

dataset = datasets.Taxis()
for x, y in dataset:
    print(x, y)
    break

References

New York City Taxi Trip Duration competition on Kaggle

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment