Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Datasets Taxis

From Leeroopedia
Revision as of 16:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Online_ml_River_Datasets_Taxis.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Online_Learning, Datasets, Regression
Last Updated 2026-02-08 16:00 GMT

Overview

Concrete dataset for regression provided by the River library.

Description

Taxi ride durations in New York City. The goal is to predict the duration of taxi rides in New York City based on pickup/dropoff locations, timestamps, and other features.

This dataset contains 1,458,644 samples with 8 features for regression tasks.

Usage

This dataset is useful for:

  • Time series regression and duration prediction
  • Geospatial feature engineering
  • Transportation and urban mobility analysis
  • Large-scale regression problems

Code Reference

Source Location

Signature

class Taxis(base.RemoteDataset):
    def __init__(self):
        super().__init__(
            n_samples=1_458_644,
            n_features=8,
            task=base.REG,
            url="https://maxhalford.github.io/files/datasets/nyc_taxis.zip",
            size=195_271_696,
            filename="train.csv",
        )

    def _iter(self):
        return stream.iter_csv(
            self.path,
            target="trip_duration",
            converters={
                "passenger_count": int,
                "pickup_longitude": float,
                "pickup_latitude": float,
                "dropoff_longitude": float,
                "dropoff_latitude": float,
                "trip_duration": int,
            },
            parse_dates={"pickup_datetime": "%Y-%m-%d %H:%M:%S"},
            drop=["dropoff_datetime", "id"],
        )

Import

from river import datasets
dataset = datasets.Taxis()

I/O Contract

Inputs

Name Type Required Description
(none) No parameters needed

Outputs

Name Type Description
iter() tuple(dict, int) Yields (features_dict, target) pairs where target is trip duration in seconds

Dataset Properties

Property Value
Number of samples 1,458,644
Number of features 8
Task Regression
Format CSV (compressed)
Size 195,271,696 bytes (~186 MB)

Features

The dataset includes the following features:

  • passenger_count: Number of passengers (integer)
  • pickup_longitude: Longitude of pickup location (float)
  • pickup_latitude: Latitude of pickup location (float)
  • dropoff_longitude: Longitude of dropoff location (float)
  • dropoff_latitude: Latitude of dropoff location (float)
  • pickup_datetime: Timestamp of pickup (datetime)
  • Additional metadata about the trip
  • trip_duration: Duration of the trip in seconds (target variable, integer)

Usage Examples

from river import datasets

dataset = datasets.Taxis()
for x, y in dataset:
    print(x, y)
    break

References

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment